Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary ops #1592

Closed
bhack opened this issue Mar 23, 2016 · 74 comments
Closed

Binary ops #1592

bhack opened this issue Mar 23, 2016 · 74 comments
Assignees
Labels

Comments

@bhack
Copy link
Contributor

@bhack bhack commented Mar 23, 2016

Is there already a plan to add binary ops like bitcount for XNOR-NET?

@vrv
Copy link
Contributor

@vrv vrv commented Mar 23, 2016

I think this would be awesome to have, and contributions are definitely welcome here :).

@bhack
Copy link
Contributor Author

@bhack bhack commented Mar 23, 2016

@vrv Probably we need a XnorGemm first in Eigen. /cc @benoitsteiner What do you think?

@vrv
Copy link
Contributor

@vrv vrv commented Mar 23, 2016

Or we could also implement them as individual OpKernels if it is too difficult to get it into Eigen.

@bhack
Copy link
Contributor Author

@bhack bhack commented Mar 23, 2016

This is a reference XnorGemm (with a custom kernel) under BSD related to a previous paper.

Edit:
The kernel is here

@bhack
Copy link
Contributor Author

@bhack bhack commented Mar 23, 2016

@scott-gray is working on an improved version with the upstream author. Scott will you release the code under BSD or Apache? Eigen library is currently on BSD and TF on Apache but needs cla signature.

@bhack
Copy link
Contributor Author

@bhack bhack commented Mar 23, 2016

/cc @mrastegari if interested

@bhack
Copy link
Contributor Author

@bhack bhack commented Mar 29, 2016

@benoitsteiner Do you think that this operations could be added in Eigen first?

@benoitsteiner
Copy link
Contributor

@benoitsteiner benoitsteiner commented Mar 29, 2016

@bhack we have a set of Eigen extensions to better support quantized operations on tensors in https://github.com/tensorflow/tensorflow/tree/master/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint. It's definitely possible to use the same approach to package a XnorGemm operation.
I can also talk to the other maintainers of Eigen to check if it makes sense to add the code into the core Eigen and make it more widely available.

@bhack
Copy link
Contributor Author

@bhack bhack commented Mar 29, 2016

@benoitsteiner Yes could be useful if you can collect some upstream opinions.

@bhack
Copy link
Contributor Author

@bhack bhack commented May 5, 2016

8 bit quantization is available now. See merged #2230

@kofd
Copy link

@kofd kofd commented May 5, 2016

Has there been any progress on this? It would be useful for some embedded applications where an nvidia gpu isn't an option

@bhack
Copy link
Contributor Author

@bhack bhack commented May 9, 2016

@bhack
Copy link
Contributor Author

@bhack bhack commented May 9, 2016

@petewarden
Copy link
Member

@petewarden petewarden commented May 9, 2016

I have been looking at 'popcount' for binary networks, as bitcount is often known, since that seems to be the trickiest part to map to processor instructions. There is some BSD-licensed work here:
https://github.com/WojciechMula/sse-popcount
Interestingly the x86 CPU instruction seems to be competitive with SSE implementations. It looks like ARM requires a multi-instruction macro though:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0081b/CHDJJGAJ.html

@bhack
Copy link
Contributor Author

@bhack bhack commented May 9, 2016

@petewarden There are also GCC built-in and llvm intrinsic. How many compilers TF wants support?

@kofd
Copy link

@kofd kofd commented May 9, 2016

@bhack: I was talking about the bit count convolutions used in xnor net.

@bhack
Copy link
Contributor Author

@bhack bhack commented May 9, 2016

@petewarden A simple test of built-in with GCC and msvc at https://github.com/hanji/popcnt/blob/master/populationcount.cpp I think it is easy to add also llvm intrinsic for popcount.

@zhengwy888
Copy link

@zhengwy888 zhengwy888 commented May 19, 2016

I have implemented a primitive op on cpu for the XOR + bitcount. but it's too slow right now. Does any one know how to speed this up? if this ever gets to the same speed as tf.matmul than I should provide a patch. Note this is not for convolution., this is simply replacing matmul() with a xnor + bit count.

  void concatenate_col(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int rowSize = int((array.dimension(0)+31)/ 32);
      out.resize(array.dimension(1),rowSize );

      for ( int c=0; c< array.dimension(1); ++ c)
      {
          for ( int r = 0; r < rowSize; ++ r )
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int rowIdx = r*32 + i;
                  if ( rowIdx > array.dimension(0)-1 ) {
                      break;
                  }
                  sign = (array(rowIdx, c )>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(c, r) = rvalue;
          }
      }
  }
  void concatenate_row(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int colSize = int((array.dimension(1)+31 )/ 32);
      out.resize(array.dimension(0), colSize);
      for ( int r = 0; r < array.dimension(0); ++ r )
      {
          for ( int c=0; c< colSize; ++ c)
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int colIdx = c*32 + i;
                  if ( colIdx > array.dimension(1)-1 ) {
                      break;
                  }
                  sign = (array(r, colIdx)>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(r,c) = rvalue;
          }
      }
  }

  void concatenate_and_compute(
          const CPUDevice &d,
          typename MatMulTypes<T>::in_type a,
          typename MatMulTypes<T>::in_type b,
          typename MatMulTypes<T>::out_type out)
  {
      MaskMatrix a_;
      MaskMatrix b_;

      concatenate_row(a, a_);
      concatenate_col(b, b_);

      for (int ar=0; ar < a_.rows(); ar++)
      {
          for (int br=0; br< b_.rows(); br++) {
              unsigned int Cvalue = 0;
              for (int c=0; c< a_.cols(); c++)
              {
                  unsigned int value =popcnt(a_(ar, c) ^ b_(br,c));
                  Cvalue += value;
              }
              out(ar, br) = - ( 2*(float)Cvalue - a.dimension(1) );
          }
      }

  }
@ppwwyyxx
Copy link
Contributor

@ppwwyyxx ppwwyyxx commented May 20, 2016

From my experience the best approach for popcnt on avx2 would be this one: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp. But that code needs a little fix for a counter overflow. xor also needs to be done in avx2.
For TF I guess there needs to be a generic non-avx2 code. There are some references in that repo as well.

@bhack
Copy link
Contributor Author

@bhack bhack commented May 20, 2016

@ppwwyyxx Have you benchmarked against recent GCC, MSVC, LLVM/CLANG intrinsics?

@ppwwyyxx
Copy link
Contributor

@ppwwyyxx ppwwyyxx commented May 20, 2016

@bhack I don't think there will be a big difference with different intrinsics. They all end up being avx2 instructions anyway.

@bhack
Copy link
Contributor Author

@bhack bhack commented May 20, 2016

Are you sure that all internal compiler code use avx2? I think that built-in are supported also on arm/neon.

@ppwwyyxx
Copy link
Contributor

@ppwwyyxx ppwwyyxx commented May 20, 2016

Oh right. if you are talking about compatibility then compiler builtins may be a good choice. But i never tried them.

@girving girving added the triaged label Jun 8, 2016
@isabel-schwende
Copy link

@isabel-schwende isabel-schwende commented Jul 6, 2016

Has there been any movement in this issue? I'm very interested in seeing how binary networks can be trained using tensorFlow. I have studied the work of Courbariaux and played a bit with his implementations (specifically BinaryConnect) but my final goal would be to have XNOR-Net running with tensorFlow.

@rapatel0
Copy link

@rapatel0 rapatel0 commented Aug 11, 2016

We got a "version" working in Eigen::Tensor (~7x performance over float on Xeon AVX-256), but we're still hitting relatively low accuracies (20 to 40 % more error than a float based net). The accuracy drops quickly as you increase the output channels in a layer.

From a performance POV the bit packing code slows things down a bit and the calculation of beta values still takes time. I'm not sure if the original paper took this into account because it defined a "convolution" as seperate from the binarize step. In practice, however, you need both for every conv2d layer. Still, I might be missing something.

BTW I'm having issues getting the code to compile in tensorflow. Works fine in Eigen though.

@zhengwy888
Copy link

@zhengwy888 zhengwy888 commented Aug 11, 2016

@rapatel0 7x performance means 7x faster? could you release some part of your code? I tried to use gcc popcnt for bitcounting but it's really slow.

@rapatel0
Copy link

@rapatel0 rapatel0 commented Aug 12, 2016

Depending on your compiler flags gcc popcount won't emit a _popcnt64 instruction which limits the performance. You should try that first. The code is still buggy and requires some more testing but I'll release it once we get a chance to clean it up.

@bhack
Copy link
Contributor Author

@bhack bhack commented Aug 14, 2016

What do you think of this Adobe patent?

@bhack
Copy link
Contributor Author

@bhack bhack commented Nov 12, 2016

For who is still interested in the topic see QNN https://arxiv.org/abs/1609.07061

@aselle aselle added type:feature and removed enhancement labels Feb 9, 2017
@eamartin
Copy link

@eamartin eamartin commented Feb 20, 2017

Any progress on this? Besides more sophisticated kernels like XnorGEMM, there's value in supporting bitwise and, or, xor on types like int32 and int64.

@Randl
Copy link
Contributor

@Randl Randl commented Feb 25, 2017

Definitely very low precision networks become more and more popular: see Trained Ternary Quantization and Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights. We are looking forward for arbitrary quantization and efficient operations (e.g. XNOR-popcount) in TensorFlow

@bhack
Copy link
Contributor Author

@bhack bhack commented Feb 25, 2017

Or in gemmlowp so other frameworks could use it.

@Alexivia
Copy link

@Alexivia Alexivia commented Mar 18, 2017

@lorenlugosch Did you managed to finish your binary convolution? I'm interested in knowing how to implement low-precision operations for the forward pass in CNNs... @bhack Do you know if there is any support for this in TF? Besides the 8bit quantisation process, I want a "hard" quantisation, not the one that uses the max and min with float values

@Cogitans
Copy link

@Cogitans Cogitans commented May 20, 2017

Does anyone have an update on progress towards binary TF ops? I'm weighing the pros and cons of working on this problem myself (the pro being that it'll be useful, and the con primarily being the technical investment of fully grokking the Eigen/cuBLAS side of TF).

Thanks!

@AngusG
Copy link

@AngusG AngusG commented Jul 28, 2017

It's still a bit rough, but here's a custom op with an mnist training example and benchmarks against tf.matmul. Feedback and suggestions welcome!

https://github.com/AngusG/tensorflow-xnor-bnn

@annarev
Copy link
Member

@annarev annarev commented Feb 8, 2018

ebrevdo@, I found that you added PopulationCount op some time ago:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/bitwise_ops.cc
which seems like it supports 'bitcount':
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/api_def/base_api/api_def_PopulationCount.pbtxt

Is this op available to use or is there more work needed?

@ebrevdo
Copy link
Contributor

@ebrevdo ebrevdo commented Feb 8, 2018

The op is supported for use; but not part of the public API; you can access it directly though. See an example here

@tensorflowbutler
Copy link
Member

@tensorflowbutler tensorflowbutler commented Feb 22, 2018

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

@tensorflowbutler tensorflowbutler commented Mar 9, 2018

Nagging Assignee @ebrevdo: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@ebrevdo
Copy link
Contributor

@ebrevdo ebrevdo commented Mar 10, 2018

Marking as closed since it's available.

@ebrevdo ebrevdo closed this Mar 10, 2018
@mohendra
Copy link

@mohendra mohendra commented Nov 7, 2018

gemm_op.so not find

@arashb
Copy link

@arashb arashb commented Feb 17, 2020

Is there already a plan to add binary ops like bitcount for XNOR-NET?

If someone is still interested, we have developed open source libraries on top of @tensorflow and TensorFlow Lite to train and deploy binarized neural networks (BNNs) similar to XNOR-NET.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.