New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Accelerate TensorFlow core on FPGA - How? #8820
Comments
This would be a great question for StackOverflow! I'm not aware of people who have talked about how to do this already, but that would be the best place to look. Thanks! |
Andrew,
It was more of a feature request than a question... Can you make sure that
it makes it to someone at Google? Although, they may have already thought
about it, or be doing it behind the scenes...
…On Fri, Mar 31, 2017 at 11:19 AM, Andrew Selle ***@***.***> wrote:
Closed #8820 <#8820>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8820 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO_Gp5LYl7TPRGgMDbhyzSOzi6PSUQBtks5rrUOfgaJpZM4Mtp4A>
.
|
I'm interested in this too. Maybe it would be better to implement an IP in fpga optimized for tensorflow, rather then a generic OpenCL IP. |
I'm also interested, but from the aspect of creating an accelerator and then using the hooks in Tensorflow to access it. |
FYI, Xilinx has DNN accelerators now. Currently they are integrated with Caffe. I'm sure tensorflow is on the roadmap.
http://www.datacenterknowledge.com/archives/2016/11/14/xilinx-unleashes-fpga-accelerator-stack-supporting-caffe-openstack/
…Sent from my iPhone
On Jun 9, 2017, at 3:04 PM, alexshirley ***@***.***> wrote:
I'm also interested, but from the aspect of creating an accelerator and then using the hooks in Tensorflow to access it.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi @aselle , Hugh Perkins has created Coriander that could run NVIDIA® CUDA™ code on OpenCL 1.2 devices including FPGA devices. You might want to take a look if that suits your need to connect your Deep Learning software to OpenCL 1.2 devices. Kindly attribute his name and his contribution in case if you plan to use his work. |
@aselle Tensorflow is always lauded for it's portability. However, even if one was to use @viper7882's solution they'd still be constrained to using gpus, traditional cpus, and commonly used FPGAs with OpenCL vendor libraries. This is not true general portability. Would it be possible to add a custom device feature/interface that allows us to add unique devices, treating the tensor forward and backward computations within the context as a giant black box, that can be fed values via grpc serialization. Or the ability to pass a custom compiled executable/binary and link it in to the rest of the graph (not just a tensorflow lambda func). This would greatly help the applied deep learning community so we can stop pulling values out of the runtime graph, processing them, and reinserting it into the computation graph. Here are just several use cases based on projects I'm working on now:
My basic understanding is that the device contexts solely serve to define compile logic of the tensors in the context and properly link the compiled nodes with the rest of the graph. I'd implement a PR myself, but I only have a good grasp on the grpc serialization logic. Wish I understood the graph compilation portion better :( |
Would it be feasible to swap out specific parts of TensorFlow and accelerate it e.g. with an FPGA implementation? Like creating FPGA-accelerated ops as @kinsumliu suggests under #12538? We've built Hastlayer, which automatically transforms .NET programs into equivalent FPGA implementations. What we could thus do is to see how implementing some TF features with Hastlayer on an FPGA would work. |
Google designed an ASIC with which they've done this called the Tensorflow Processing Unit (TPU). There's a fair amount of information on it that has been publicly released.
You could use this as a starting point to design soft IP for an FPGA that could support a handful of tensorflow Ops you'd like to speed up, then write the TF device handlers and ops. You could potentially even design the software interface with Xilinx's OpenCL macros (and maybe even the IP) to make the tensorflow Ops simpler to implement so you don't need to write HDL manually. There's also reconfigure.io (program an FPGA with Go). It sounds like Hastlayer that @Piedone mentiond does something similar as well. |
Yes, Hastlayer works in a very similar way @stevefox . What do you think, which part of TF would be most interesting to swap out with an accelerated implementation? It seems to me that the most important piece of the TPU is the matrix multiplier, a somewhat similar feature to GPUs. |
I was able to get Tensorflow Lite compiled for the Zynq and did a quick benchmark. I typed up my results here: https://github.com/nlbutts/tflite_zynq My next step was looking at accelerating the convolution operation, which is where the software spends most of it's time when using Mobilenet. |
@nlbutts did you make any progress? Is anyone aware of a solution that interfaces Tensorflow with FPGAs? There are some old attempts to add new devices to Tensorflow, but I did not see anything about FPGAs in particular. |
To those interested. I would like to share some thoughts, and some info.
Full disclosure, I opened this feature request about 1.5 years ago, back
when I was more clueless. Also, I have worked at Xilinx for the past 4
years. Just 3 months ago I started as a "Machine Learning Engineer", so now
I am very aware of Xilinx's approach in this space.
Machine Learning/Computer Vision is a very fast paced market. It seems like
every year there is a new state of the art network architecture. Given that
an FPGA design cycle can take 6 months to a year, especially when a company
is lacking FPGA know-how, Xilinx provisioned a hardware and software team
to design a "soft" neural network accelerator core. The core is meant to be
general purpose, in that it can accelerate most network architectures. We
are calling it "XDNN". The software is getting tagged as "xfDNN". Xilinx
Fast DNN. We are so great at coming up with names!
The core is designed specifically for accelerating inference, and it takes
advantage of fixed point arithmetic to squeeze more compute into the FPGA
fabric/DSPs. This means trained networks must undergo an offline
quantization process, to go from float32 to int8. There is various research
showing that this is an effective method to achieve faster inference with
minimal loss in accuracy (1-2%).
The core accelerates convolutions, pooling (Max/Average), eltwise add, and
ReLU. We run the majority of the network on the FPGA, and leave the final
fully connected, or region layers to the CPU. (There was no gain in adding
hardware support for these final layers, since FC layers are being used
less and less, and region layers can be quite custom).
Currently, we only support cloud: "Amazon AWS EC2 F1", and on-premise:
"VCU1525". However we are working on bringing the core into Zynq
UltraScale+ devices, as well as other cloud providers such as Nimbix.
Check out our repo if you are interested: https://github.com/Xilinx/ml-suite
Also, you can kick the tires directly at:
https://aws.amazon.com/marketplace/pp/B077FM2JNS
Right now, the documentation on github isn't great, but we are about to
push a new release with much improved docs. There is an EA branch that
shows a lot of our software. My job right now is to enhance our
documentation, and also add usability features (Ease of use Python stuff).
So feel free to complain to me, it is my job. We have some really nice
jupyter notebook tutorials coming. I can be reached at
b.lozano.havoc@gmail.com or bryan.lozano@xilinx.com.
Now we come to framework support...
Initially, we did some experiments, enabling caffe to directly call our
accelerator. We got it working, but it was clear that there would be a lot
of work to add support for various networks, and to support this across
Caffe/Tensorflow/MXNET. Tensorflow was still evolving when we started. We
simply didn't have enough software people to support that.
What we settled on is a compiler frontend that takes your network
definition, and weights. The compiler does some optimization (layer
merging), then generates a schedule of commands to accelerate a given
network. This way, the majority of our software can be framework agnostic,
but the compiler will have to "multi-lingual". The compiler reads prototxt,
or frozen tensorflow graphs for example.
I would still love it, if tensorflow had hooks to directly run on the FPGA,
but I don't see it happening anytime soon. It probably wouldn't be that
bad, especially if Tensorflow can make OpenCL calls.
Anyways, if you want more info, don't hesitate to reach out.
Again, I can be reached at b.lozano.havoc@gmail.com or
bryan.lozano@xilinx.com
Thanks,
Bryan
…On Fri, Jul 6, 2018 at 1:11 PM, GDG ***@***.***> wrote:
@nlbutts <https://github.com/nlbutts> did you make any progress? Is
anyone aware of a solution that interfaces Tensorflow with FPGAs?
There are some old attempts to add new devices to Tensorflow, but I did
not see anything about FPGAs in particular.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8820 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO_Gp91PZcBHOvm6xGlQuTOMPtmufXmdks5uD8SLgaJpZM4Mtp4A>
.
|
Now, I add a fpga device in tensorflow, and I can call the kernel of fpga device to perform an addition operation,but the fpga device is also a virtual device ,I do not realize API to collect the info of fpga, I will do it later. |
This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 1 year. |
Consider the two following hardware scenarios:
How could the tensorflow core be accelerated for these scenarios?
FPGA vendors do offer OpenCL binaries for running OpenCL APIs to parallelize computations.
Forgive me, I am a bit ignorant, still learning in this area, but I am intrigued by the future possibility of this, and would love to help in anyway I can.
The text was updated successfully, but these errors were encountered: