New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary ops #1592

Closed
bhack opened this Issue Mar 23, 2016 · 72 comments

Comments

Projects
None yet
@bhack
Contributor

bhack commented Mar 23, 2016

Is there already a plan to add binary ops like bitcount for XNOR-NET?

@vrv

This comment has been minimized.

Show comment
Hide comment
@vrv

vrv Mar 23, 2016

Contributor

I think this would be awesome to have, and contributions are definitely welcome here :).

Contributor

vrv commented Mar 23, 2016

I think this would be awesome to have, and contributions are definitely welcome here :).

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Mar 23, 2016

Contributor

@vrv Probably we need a XnorGemm first in Eigen. /cc @benoitsteiner What do you think?

Contributor

bhack commented Mar 23, 2016

@vrv Probably we need a XnorGemm first in Eigen. /cc @benoitsteiner What do you think?

@vrv

This comment has been minimized.

Show comment
Hide comment
@vrv

vrv Mar 23, 2016

Contributor

Or we could also implement them as individual OpKernels if it is too difficult to get it into Eigen.

Contributor

vrv commented Mar 23, 2016

Or we could also implement them as individual OpKernels if it is too difficult to get it into Eigen.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Mar 23, 2016

Contributor

This is a reference XnorGemm (with a custom kernel) under BSD related to a previous paper.

Edit:
The kernel is here

Contributor

bhack commented Mar 23, 2016

This is a reference XnorGemm (with a custom kernel) under BSD related to a previous paper.

Edit:
The kernel is here

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Mar 23, 2016

Contributor

@scott-gray is working on an improved version with the upstream author. Scott will you release the code under BSD or Apache? Eigen library is currently on BSD and TF on Apache but needs cla signature.

Contributor

bhack commented Mar 23, 2016

@scott-gray is working on an improved version with the upstream author. Scott will you release the code under BSD or Apache? Eigen library is currently on BSD and TF on Apache but needs cla signature.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Mar 23, 2016

Contributor

/cc @mrastegari if interested

Contributor

bhack commented Mar 23, 2016

/cc @mrastegari if interested

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Mar 29, 2016

Contributor

@benoitsteiner Do you think that this operations could be added in Eigen first?

Contributor

bhack commented Mar 29, 2016

@benoitsteiner Do you think that this operations could be added in Eigen first?

@benoitsteiner

This comment has been minimized.

Show comment
Hide comment
@benoitsteiner

benoitsteiner Mar 29, 2016

Contributor

@bhack we have a set of Eigen extensions to better support quantized operations on tensors in https://github.com/tensorflow/tensorflow/tree/master/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint. It's definitely possible to use the same approach to package a XnorGemm operation.
I can also talk to the other maintainers of Eigen to check if it makes sense to add the code into the core Eigen and make it more widely available.

Contributor

benoitsteiner commented Mar 29, 2016

@bhack we have a set of Eigen extensions to better support quantized operations on tensors in https://github.com/tensorflow/tensorflow/tree/master/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint. It's definitely possible to use the same approach to package a XnorGemm operation.
I can also talk to the other maintainers of Eigen to check if it makes sense to add the code into the core Eigen and make it more widely available.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Mar 29, 2016

Contributor

@benoitsteiner Yes could be useful if you can collect some upstream opinions.

Contributor

bhack commented Mar 29, 2016

@benoitsteiner Yes could be useful if you can collect some upstream opinions.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 5, 2016

Contributor

8 bit quantization is available now. See merged #2230

Contributor

bhack commented May 5, 2016

8 bit quantization is available now. See merged #2230

@kofd

This comment has been minimized.

Show comment
Hide comment
@kofd

kofd May 5, 2016

Has there been any progress on this? It would be useful for some embedded applications where an nvidia gpu isn't an option

kofd commented May 5, 2016

Has there been any progress on this? It would be useful for some embedded applications where an nvidia gpu isn't an option

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 9, 2016

Contributor
Contributor

bhack commented May 9, 2016

@petewarden

This comment has been minimized.

Show comment
Hide comment
@petewarden

petewarden May 9, 2016

Member

I have been looking at 'popcount' for binary networks, as bitcount is often known, since that seems to be the trickiest part to map to processor instructions. There is some BSD-licensed work here:
https://github.com/WojciechMula/sse-popcount
Interestingly the x86 CPU instruction seems to be competitive with SSE implementations. It looks like ARM requires a multi-instruction macro though:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0081b/CHDJJGAJ.html

Member

petewarden commented May 9, 2016

I have been looking at 'popcount' for binary networks, as bitcount is often known, since that seems to be the trickiest part to map to processor instructions. There is some BSD-licensed work here:
https://github.com/WojciechMula/sse-popcount
Interestingly the x86 CPU instruction seems to be competitive with SSE implementations. It looks like ARM requires a multi-instruction macro though:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0081b/CHDJJGAJ.html

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 9, 2016

Contributor

@petewarden There are also GCC built-in and llvm intrinsic. How many compilers TF wants support?

Contributor

bhack commented May 9, 2016

@petewarden There are also GCC built-in and llvm intrinsic. How many compilers TF wants support?

@kofd

This comment has been minimized.

Show comment
Hide comment
@kofd

kofd May 9, 2016

@bhack: I was talking about the bit count convolutions used in xnor net.

kofd commented May 9, 2016

@bhack: I was talking about the bit count convolutions used in xnor net.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 9, 2016

Contributor

@petewarden A simple test of built-in with GCC and msvc at https://github.com/hanji/popcnt/blob/master/populationcount.cpp I think it is easy to add also llvm intrinsic for popcount.

Contributor

bhack commented May 9, 2016

@petewarden A simple test of built-in with GCC and msvc at https://github.com/hanji/popcnt/blob/master/populationcount.cpp I think it is easy to add also llvm intrinsic for popcount.

@zhengwy888

This comment has been minimized.

Show comment
Hide comment
@zhengwy888

zhengwy888 May 19, 2016

I have implemented a primitive op on cpu for the XOR + bitcount. but it's too slow right now. Does any one know how to speed this up? if this ever gets to the same speed as tf.matmul than I should provide a patch. Note this is not for convolution., this is simply replacing matmul() with a xnor + bit count.

  void concatenate_col(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int rowSize = int((array.dimension(0)+31)/ 32);
      out.resize(array.dimension(1),rowSize );

      for ( int c=0; c< array.dimension(1); ++ c)
      {
          for ( int r = 0; r < rowSize; ++ r )
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int rowIdx = r*32 + i;
                  if ( rowIdx > array.dimension(0)-1 ) {
                      break;
                  }
                  sign = (array(rowIdx, c )>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(c, r) = rvalue;
          }
      }
  }
  void concatenate_row(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int colSize = int((array.dimension(1)+31 )/ 32);
      out.resize(array.dimension(0), colSize);
      for ( int r = 0; r < array.dimension(0); ++ r )
      {
          for ( int c=0; c< colSize; ++ c)
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int colIdx = c*32 + i;
                  if ( colIdx > array.dimension(1)-1 ) {
                      break;
                  }
                  sign = (array(r, colIdx)>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(r,c) = rvalue;
          }
      }
  }

  void concatenate_and_compute(
          const CPUDevice &d,
          typename MatMulTypes<T>::in_type a,
          typename MatMulTypes<T>::in_type b,
          typename MatMulTypes<T>::out_type out)
  {
      MaskMatrix a_;
      MaskMatrix b_;

      concatenate_row(a, a_);
      concatenate_col(b, b_);

      for (int ar=0; ar < a_.rows(); ar++)
      {
          for (int br=0; br< b_.rows(); br++) {
              unsigned int Cvalue = 0;
              for (int c=0; c< a_.cols(); c++)
              {
                  unsigned int value =popcnt(a_(ar, c) ^ b_(br,c));
                  Cvalue += value;
              }
              out(ar, br) = - ( 2*(float)Cvalue - a.dimension(1) );
          }
      }

  }

zhengwy888 commented May 19, 2016

I have implemented a primitive op on cpu for the XOR + bitcount. but it's too slow right now. Does any one know how to speed this up? if this ever gets to the same speed as tf.matmul than I should provide a patch. Note this is not for convolution., this is simply replacing matmul() with a xnor + bit count.

  void concatenate_col(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int rowSize = int((array.dimension(0)+31)/ 32);
      out.resize(array.dimension(1),rowSize );

      for ( int c=0; c< array.dimension(1); ++ c)
      {
          for ( int r = 0; r < rowSize; ++ r )
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int rowIdx = r*32 + i;
                  if ( rowIdx > array.dimension(0)-1 ) {
                      break;
                  }
                  sign = (array(rowIdx, c )>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(c, r) = rvalue;
          }
      }
  }
  void concatenate_row(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int colSize = int((array.dimension(1)+31 )/ 32);
      out.resize(array.dimension(0), colSize);
      for ( int r = 0; r < array.dimension(0); ++ r )
      {
          for ( int c=0; c< colSize; ++ c)
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int colIdx = c*32 + i;
                  if ( colIdx > array.dimension(1)-1 ) {
                      break;
                  }
                  sign = (array(r, colIdx)>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(r,c) = rvalue;
          }
      }
  }

  void concatenate_and_compute(
          const CPUDevice &d,
          typename MatMulTypes<T>::in_type a,
          typename MatMulTypes<T>::in_type b,
          typename MatMulTypes<T>::out_type out)
  {
      MaskMatrix a_;
      MaskMatrix b_;

      concatenate_row(a, a_);
      concatenate_col(b, b_);

      for (int ar=0; ar < a_.rows(); ar++)
      {
          for (int br=0; br< b_.rows(); br++) {
              unsigned int Cvalue = 0;
              for (int c=0; c< a_.cols(); c++)
              {
                  unsigned int value =popcnt(a_(ar, c) ^ b_(br,c));
                  Cvalue += value;
              }
              out(ar, br) = - ( 2*(float)Cvalue - a.dimension(1) );
          }
      }

  }
@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx May 20, 2016

Contributor

From my experience the best approach for popcnt on avx2 would be this one: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp. But that code needs a little fix for a counter overflow. xor also needs to be done in avx2.
For TF I guess there needs to be a generic non-avx2 code. There are some references in that repo as well.

Contributor

ppwwyyxx commented May 20, 2016

From my experience the best approach for popcnt on avx2 would be this one: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp. But that code needs a little fix for a counter overflow. xor also needs to be done in avx2.
For TF I guess there needs to be a generic non-avx2 code. There are some references in that repo as well.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 20, 2016

Contributor

@ppwwyyxx Have you benchmarked against recent GCC, MSVC, LLVM/CLANG intrinsics?

Contributor

bhack commented May 20, 2016

@ppwwyyxx Have you benchmarked against recent GCC, MSVC, LLVM/CLANG intrinsics?

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx May 20, 2016

Contributor

@bhack I don't think there will be a big difference with different intrinsics. They all end up being avx2 instructions anyway.

Contributor

ppwwyyxx commented May 20, 2016

@bhack I don't think there will be a big difference with different intrinsics. They all end up being avx2 instructions anyway.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack May 20, 2016

Contributor

Are you sure that all internal compiler code use avx2? I think that built-in are supported also on arm/neon.

Contributor

bhack commented May 20, 2016

Are you sure that all internal compiler code use avx2? I think that built-in are supported also on arm/neon.

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx May 20, 2016

Contributor

Oh right. if you are talking about compatibility then compiler builtins may be a good choice. But i never tried them.

Contributor

ppwwyyxx commented May 20, 2016

Oh right. if you are talking about compatibility then compiler builtins may be a good choice. But i never tried them.

@girving girving added the triaged label Jun 8, 2016

@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Jul 6, 2016

Has there been any movement in this issue? I'm very interested in seeing how binary networks can be trained using tensorFlow. I have studied the work of Courbariaux and played a bit with his implementations (specifically BinaryConnect) but my final goal would be to have XNOR-Net running with tensorFlow.

isabel-schwende commented Jul 6, 2016

Has there been any movement in this issue? I'm very interested in seeing how binary networks can be trained using tensorFlow. I have studied the work of Courbariaux and played a bit with his implementations (specifically BinaryConnect) but my final goal would be to have XNOR-Net running with tensorFlow.

@lorenlugosch

This comment has been minimized.

Show comment
Hide comment
@lorenlugosch

lorenlugosch Jul 8, 2016

would also be interested in this!

lorenlugosch commented Jul 8, 2016

would also be interested in this!

@zhengwy888

This comment has been minimized.

Show comment
Hide comment
@zhengwy888

zhengwy888 Jul 20, 2016

binarization can be done with tf.sign(). though the tricky part is to get the gradient backprop to work after binarizing input. for now this requires a separate op I implemented in tensorflow. https://github.com/zhengwy888/binary_ops. with this code you can implement your own XNOR net on GPU. comments/suggestions welcome.

zhengwy888 commented Jul 20, 2016

binarization can be done with tf.sign(). though the tricky part is to get the gradient backprop to work after binarizing input. for now this requires a separate op I implemented in tensorflow. https://github.com/zhengwy888/binary_ops. with this code you can implement your own XNOR net on GPU. comments/suggestions welcome.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Jul 20, 2016

Contributor

For who is interested there is also https://arxiv.org/abs/1606.06160

Contributor

bhack commented Jul 20, 2016

For who is interested there is also https://arxiv.org/abs/1606.06160

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Jul 20, 2016

Contributor

Thanks @bhack for mentioning. We have a DoReFa-Net training implementation available at dorefa.net, which doesn't make use of any custom C++ Op. Since DoReFa-Net is a generalization of XNOR-Net, XNOR-Net can be built in TF in a similar manner (without binary op acceleration). I'm also releasing a trainable DoReLa(1,2,6)-AlexNet later today.

Contributor

ppwwyyxx commented Jul 20, 2016

Thanks @bhack for mentioning. We have a DoReFa-Net training implementation available at dorefa.net, which doesn't make use of any custom C++ Op. Since DoReFa-Net is a generalization of XNOR-Net, XNOR-Net can be built in TF in a similar manner (without binary op acceleration). I'm also releasing a trainable DoReLa(1,2,6)-AlexNet later today.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Jul 20, 2016

Contributor

@ppwwyyxx Nice! /cc @wangyida

Contributor

bhack commented Jul 20, 2016

@ppwwyyxx Nice! /cc @wangyida

@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Jul 21, 2016

Thanks for sharing everyone. But as I understood the DoReFa implementation so far, it still uses the standard 32 bit tensors. I've been thinking about ways to use the official TensorFlow quantization methods to reduce the memory at least to a quarter. The 8-bit datatype is available. Of course the question would be how to use it with the least hassle.

isabel-schwende commented Jul 21, 2016

Thanks for sharing everyone. But as I understood the DoReFa implementation so far, it still uses the standard 32 bit tensors. I've been thinking about ways to use the official TensorFlow quantization methods to reduce the memory at least to a quarter. The 8-bit datatype is available. Of course the question would be how to use it with the least hassle.

@wangyida

This comment has been minimized.

Show comment
Hide comment
@wangyida

wangyida Jul 21, 2016

In my perspective of view. released pipeline of DeRoFa is still float, and quantization module in TF could have a 8 bit representation with little performance drop, but this won't be the aim of DeRoFa which is already a quantized model just in a float representation of codes.

wangyida commented Jul 21, 2016

In my perspective of view. released pipeline of DeRoFa is still float, and quantization module in TF could have a 8 bit representation with little performance drop, but this won't be the aim of DeRoFa which is already a quantized model just in a float representation of codes.

@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Jul 21, 2016

@wangyida Yes, I agree that it seems that the intention of the authors of DoReFa was mostly to reduce training and inference time by using low bitwidth. I don't expect them to release an 8bit version. However they also mention in their paper the idea to use this kind of network on embedded devices. If you want to use AlexNet on a very small device, I don't see the reason to use 32 bit floats if there is no information held in them anyways. To me, the quantization to a smaller datatype is just the next logical step.

isabel-schwende commented Jul 21, 2016

@wangyida Yes, I agree that it seems that the intention of the authors of DoReFa was mostly to reduce training and inference time by using low bitwidth. I don't expect them to release an 8bit version. However they also mention in their paper the idea to use this kind of network on embedded devices. If you want to use AlexNet on a very small device, I don't see the reason to use 32 bit floats if there is no information held in them anyways. To me, the quantization to a smaller datatype is just the next logical step.

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Jul 21, 2016

Contributor

@isabel-schwende Yes the released implementation uses float32 to hold low-bitwidth numbers, because there is simply no low-bitwidth operations available in TF. And we never planned to build such operations into TF because we already have our own low bitwidth run-time implementations working smoothly on ARM.
The released model is similar: it uses tf.float32 to hold all the binary weights as well as run all the computation. But anyone who would like to implement those binary operations can make use of our pretrained model directly and gain a speedup.

Contributor

ppwwyyxx commented Jul 21, 2016

@isabel-schwende Yes the released implementation uses float32 to hold low-bitwidth numbers, because there is simply no low-bitwidth operations available in TF. And we never planned to build such operations into TF because we already have our own low bitwidth run-time implementations working smoothly on ARM.
The released model is similar: it uses tf.float32 to hold all the binary weights as well as run all the computation. But anyone who would like to implement those binary operations can make use of our pretrained model directly and gain a speedup.

@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Jul 21, 2016

@ppwwyyxx Thank you for your clarification but I think I have to disagree at this point. Yes, there is no Datatype in TensorFlow available for 1,2 or 6 bit but there is the tf.quint8 datatype for tensors. @petewarden and his team have introduced it in their tutorial here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/quantization/index.md Sure, the method of quantization is different to what DoReFa is doing as they keep minimums and maximums as floats. I've played around with the tutorial and also used a customised AlexNet saved in a protobuf-file to quantize down to 8 bit. I was able to observe that the protobuf-file of the quantized network was indeed much smaller compared to the 32 float original. For now, the way how low-bitwidth weights/activations are used is not compatible but so I was wondering if there is a way to, let's say, create a customized version of the existing quantization tool to also decrease the memory of the DoReFa AlexNet model for inference tasks on small devices. But I guess that it would be still too much work at the moment.

isabel-schwende commented Jul 21, 2016

@ppwwyyxx Thank you for your clarification but I think I have to disagree at this point. Yes, there is no Datatype in TensorFlow available for 1,2 or 6 bit but there is the tf.quint8 datatype for tensors. @petewarden and his team have introduced it in their tutorial here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/quantization/index.md Sure, the method of quantization is different to what DoReFa is doing as they keep minimums and maximums as floats. I've played around with the tutorial and also used a customised AlexNet saved in a protobuf-file to quantize down to 8 bit. I was able to observe that the protobuf-file of the quantized network was indeed much smaller compared to the 32 float original. For now, the way how low-bitwidth weights/activations are used is not compatible but so I was wondering if there is a way to, let's say, create a customized version of the existing quantization tool to also decrease the memory of the DoReFa AlexNet model for inference tasks on small devices. But I guess that it would be still too much work at the moment.

@mrastegari

This comment has been minimized.

Show comment
Hide comment
@mrastegari

mrastegari commented Jul 29, 2016

@AaronYKing

This comment has been minimized.

Show comment
Hide comment
@AaronYKing

AaronYKing Jul 29, 2016

@isabel-schwende Thank you very much for your patient replay. Now the author have released the torch code.

AaronYKing commented Jul 29, 2016

@isabel-schwende Thank you very much for your patient replay. Now the author have released the torch code.

@AaronYKing

This comment has been minimized.

Show comment
Hide comment
@AaronYKing

AaronYKing Jul 29, 2016

@mrastegari Thank you for your contribution. I wonder if it can be implemented by Caffe and will anybody implement it in the near future?

AaronYKing commented Jul 29, 2016

@mrastegari Thank you for your contribution. I wonder if it can be implemented by Caffe and will anybody implement it in the near future?

@isabel-schwende

This comment has been minimized.

Show comment
Hide comment
@isabel-schwende

isabel-schwende Jul 29, 2016

@mrastegari thanks a lot for sharing this information with us. I think this is going to be really helpful. Do you have any numerical information of how slow this Torch version is compared to the original version using Darknet? @AaronYKing I'm always glad if there are cases where I'm positively surprised about code being shared publicly.

isabel-schwende commented Jul 29, 2016

@mrastegari thanks a lot for sharing this information with us. I think this is going to be really helpful. Do you have any numerical information of how slow this Torch version is compared to the original version using Darknet? @AaronYKing I'm always glad if there are cases where I'm positively surprised about code being shared publicly.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack
Contributor

bhack commented Jul 29, 2016

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Jul 29, 2016

Contributor

@bhack This kernel works, but only about 3.4x faster than best fp32 kernels (cublas). It's mentioned in the BNN paper and MatthieuCourbariaux/BinaryNet#1.

Contributor

ppwwyyxx commented Jul 29, 2016

@bhack This kernel works, but only about 3.4x faster than best fp32 kernels (cublas). It's mentioned in the BNN paper and MatthieuCourbariaux/BinaryNet#1.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 29, 2016

Just curious, is anyone seriously pursuing this? I've been working on a fast C++ implementation of XNORNets here at AI2 with @mrastegari and others for Intel and ARM CPUs. We've achieved a modicum of success, with much headroom still remaining untapped. We're kicking around the idea of producing a fast, reference CPU implementation, so it'd be good to know if someone else is already close to releasing it.

ghost commented Jul 29, 2016

Just curious, is anyone seriously pursuing this? I've been working on a fast C++ implementation of XNORNets here at AI2 with @mrastegari and others for Intel and ARM CPUs. We've achieved a modicum of success, with much headroom still remaining untapped. We're kicking around the idea of producing a fast, reference CPU implementation, so it'd be good to know if someone else is already close to releasing it.

@csyking

This comment has been minimized.

Show comment
Hide comment
@csyking

csyking Jul 30, 2016

@dmitryb-ai2 @mrastegari It will be a great progress if the XNORNets can be implemented by a fast C++ for Intel and ARM CPUs, especially getting 58× faster convolutional operations and 32× memory savings as mentioned in the paper. But, sadly, I can't code it. So looking forward to your good news!

csyking commented Jul 30, 2016

@dmitryb-ai2 @mrastegari It will be a great progress if the XNORNets can be implemented by a fast C++ for Intel and ARM CPUs, especially getting 58× faster convolutional operations and 32× memory savings as mentioned in the paper. But, sadly, I can't code it. So looking forward to your good news!

@lorenlugosch

This comment has been minimized.

Show comment
Hide comment
@lorenlugosch

lorenlugosch Aug 1, 2016

I have an "implementation", but it achieves only ~15% accuracy on CIFAR-10
(compared to 86% using real-valued convolutions for the same architecture),
so something is definitely wrong. I will post my code if and when I get it
to work.

On Fri, Jul 29, 2016 at 7:43 PM, dmitryb-ai2 notifications@github.com
wrote:

Just curious, is anyone seriously pursuing this? I've been working on a
fast C++ implementation of XNORNets here at AI2 with @mrastegari
https://github.com/mrastegari and others for Intel and ARM CPUs. We've
achieved a modicum of success, with much headroom still remaining untapped.
We're kicking around the idea of producing a fast, reference CPU
implementation, so it'd be good to know if someone else is already close to
releasing it.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1592 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACMfUUf79HDexQwcSXpxV9N5E-K85Wxiks5qapAigaJpZM4H23wu
.

lorenlugosch commented Aug 1, 2016

I have an "implementation", but it achieves only ~15% accuracy on CIFAR-10
(compared to 86% using real-valued convolutions for the same architecture),
so something is definitely wrong. I will post my code if and when I get it
to work.

On Fri, Jul 29, 2016 at 7:43 PM, dmitryb-ai2 notifications@github.com
wrote:

Just curious, is anyone seriously pursuing this? I've been working on a
fast C++ implementation of XNORNets here at AI2 with @mrastegari
https://github.com/mrastegari and others for Intel and ARM CPUs. We've
achieved a modicum of success, with much headroom still remaining untapped.
We're kicking around the idea of producing a fast, reference CPU
implementation, so it'd be good to know if someone else is already close to
releasing it.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1592 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACMfUUf79HDexQwcSXpxV9N5E-K85Wxiks5qapAigaJpZM4H23wu
.

@csyking

This comment has been minimized.

Show comment
Hide comment
@csyking

csyking Aug 2, 2016

@lorenlugosch Hi, I am interested in your problem. What framework do you use? And will you release it to the git?

csyking commented Aug 2, 2016

@lorenlugosch Hi, I am interested in your problem. What framework do you use? And will you release it to the git?

@lorenlugosch

This comment has been minimized.

Show comment
Hide comment
@lorenlugosch

lorenlugosch Aug 2, 2016

Hi @csyking, my binary convolution operation is a custom TensorFlow op written in C++. Everything else is normal TensorFlow operations. I will publish the code once I get the accuracy up.

lorenlugosch commented Aug 2, 2016

Hi @csyking, my binary convolution operation is a custom TensorFlow op written in C++. Everything else is normal TensorFlow operations. I will publish the code once I get the accuracy up.

@csyking

This comment has been minimized.

Show comment
Hide comment
@csyking

csyking Aug 7, 2016

Hi @lorenlugosch, It's very kind of you. Looking forward to your good new!

csyking commented Aug 7, 2016

Hi @lorenlugosch, It's very kind of you. Looking forward to your good new!

@maradonalavezzi

This comment has been minimized.

Show comment
Hide comment
@maradonalavezzi

maradonalavezzi Aug 10, 2016

@mrastegari May you share the model for the Resnet-18? I am training it starting from the model you proposed in your implementation (with some modifications) but the progresses seem to be very slow.

maradonalavezzi commented Aug 10, 2016

@mrastegari May you share the model for the Resnet-18? I am training it starting from the model you proposed in your implementation (with some modifications) but the progresses seem to be very slow.

@rapatel0

This comment has been minimized.

Show comment
Hide comment
@rapatel0

rapatel0 Aug 11, 2016

We got a "version" working in Eigen::Tensor (~7x performance over float on Xeon AVX-256), but we're still hitting relatively low accuracies (20 to 40 % more error than a float based net). The accuracy drops quickly as you increase the output channels in a layer.

From a performance POV the bit packing code slows things down a bit and the calculation of beta values still takes time. I'm not sure if the original paper took this into account because it defined a "convolution" as seperate from the binarize step. In practice, however, you need both for every conv2d layer. Still, I might be missing something.

BTW I'm having issues getting the code to compile in tensorflow. Works fine in Eigen though.

rapatel0 commented Aug 11, 2016

We got a "version" working in Eigen::Tensor (~7x performance over float on Xeon AVX-256), but we're still hitting relatively low accuracies (20 to 40 % more error than a float based net). The accuracy drops quickly as you increase the output channels in a layer.

From a performance POV the bit packing code slows things down a bit and the calculation of beta values still takes time. I'm not sure if the original paper took this into account because it defined a "convolution" as seperate from the binarize step. In practice, however, you need both for every conv2d layer. Still, I might be missing something.

BTW I'm having issues getting the code to compile in tensorflow. Works fine in Eigen though.

@zhengwy888

This comment has been minimized.

Show comment
Hide comment
@zhengwy888

zhengwy888 Aug 11, 2016

@rapatel0 7x performance means 7x faster? could you release some part of your code? I tried to use gcc popcnt for bitcounting but it's really slow.

zhengwy888 commented Aug 11, 2016

@rapatel0 7x performance means 7x faster? could you release some part of your code? I tried to use gcc popcnt for bitcounting but it's really slow.

@rapatel0

This comment has been minimized.

Show comment
Hide comment
@rapatel0

rapatel0 Aug 12, 2016

Depending on your compiler flags gcc popcount won't emit a _popcnt64 instruction which limits the performance. You should try that first. The code is still buggy and requires some more testing but I'll release it once we get a chance to clean it up.

rapatel0 commented Aug 12, 2016

Depending on your compiler flags gcc popcount won't emit a _popcnt64 instruction which limits the performance. You should try that first. The code is still buggy and requires some more testing but I'll release it once we get a chance to clean it up.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Aug 14, 2016

Contributor

What do you think of this Adobe patent?

Contributor

bhack commented Aug 14, 2016

What do you think of this Adobe patent?

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Nov 12, 2016

Contributor

For who is still interested in the topic see QNN https://arxiv.org/abs/1609.07061

Contributor

bhack commented Nov 12, 2016

For who is still interested in the topic see QNN https://arxiv.org/abs/1609.07061

@aselle aselle added type:feature and removed enhancement labels Feb 9, 2017

@eamartin

This comment has been minimized.

Show comment
Hide comment
@eamartin

eamartin Feb 20, 2017

Any progress on this? Besides more sophisticated kernels like XnorGEMM, there's value in supporting bitwise and, or, xor on types like int32 and int64.

eamartin commented Feb 20, 2017

Any progress on this? Besides more sophisticated kernels like XnorGEMM, there's value in supporting bitwise and, or, xor on types like int32 and int64.

@Randl

This comment has been minimized.

Show comment
Hide comment
@Randl

Randl Feb 25, 2017

Contributor

Definitely very low precision networks become more and more popular: see Trained Ternary Quantization and Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights. We are looking forward for arbitrary quantization and efficient operations (e.g. XNOR-popcount) in TensorFlow

Contributor

Randl commented Feb 25, 2017

Definitely very low precision networks become more and more popular: see Trained Ternary Quantization and Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights. We are looking forward for arbitrary quantization and efficient operations (e.g. XNOR-popcount) in TensorFlow

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Feb 25, 2017

Contributor

Or in gemmlowp so other frameworks could use it.

Contributor

bhack commented Feb 25, 2017

Or in gemmlowp so other frameworks could use it.

@Alexivia

This comment has been minimized.

Show comment
Hide comment
@Alexivia

Alexivia Mar 18, 2017

@lorenlugosch Did you managed to finish your binary convolution? I'm interested in knowing how to implement low-precision operations for the forward pass in CNNs... @bhack Do you know if there is any support for this in TF? Besides the 8bit quantisation process, I want a "hard" quantisation, not the one that uses the max and min with float values

Alexivia commented Mar 18, 2017

@lorenlugosch Did you managed to finish your binary convolution? I'm interested in knowing how to implement low-precision operations for the forward pass in CNNs... @bhack Do you know if there is any support for this in TF? Besides the 8bit quantisation process, I want a "hard" quantisation, not the one that uses the max and min with float values

@Cogitans

This comment has been minimized.

Show comment
Hide comment
@Cogitans

Cogitans May 20, 2017

Does anyone have an update on progress towards binary TF ops? I'm weighing the pros and cons of working on this problem myself (the pro being that it'll be useful, and the con primarily being the technical investment of fully grokking the Eigen/cuBLAS side of TF).

Thanks!

Cogitans commented May 20, 2017

Does anyone have an update on progress towards binary TF ops? I'm weighing the pros and cons of working on this problem myself (the pro being that it'll be useful, and the con primarily being the technical investment of fully grokking the Eigen/cuBLAS side of TF).

Thanks!

@Randl

This comment has been minimized.

Show comment
Hide comment
@AngusG

This comment has been minimized.

Show comment
Hide comment
@AngusG

AngusG Jul 28, 2017

It's still a bit rough, but here's a custom op with an mnist training example and benchmarks against tf.matmul. Feedback and suggestions welcome!

https://github.com/AngusG/tensorflow-xnor-bnn

AngusG commented Jul 28, 2017

It's still a bit rough, but here's a custom op with an mnist training example and benchmarks against tf.matmul. Feedback and suggestions welcome!

https://github.com/AngusG/tensorflow-xnor-bnn

@annarev

This comment has been minimized.

Show comment
Hide comment
@annarev

annarev Feb 8, 2018

Member

ebrevdo@, I found that you added PopulationCount op some time ago:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/bitwise_ops.cc
which seems like it supports 'bitcount':
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/api_def/base_api/api_def_PopulationCount.pbtxt

Is this op available to use or is there more work needed?

Member

annarev commented Feb 8, 2018

ebrevdo@, I found that you added PopulationCount op some time ago:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/bitwise_ops.cc
which seems like it supports 'bitcount':
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/api_def/base_api/api_def_PopulationCount.pbtxt

Is this op available to use or is there more work needed?

@ebrevdo

This comment has been minimized.

Show comment
Hide comment
@ebrevdo

ebrevdo Feb 8, 2018

Contributor

The op is supported for use; but not part of the public API; you can access it directly though. See an example here

Contributor

ebrevdo commented Feb 8, 2018

The op is supported for use; but not part of the public API; you can access it directly though. See an example here

@tensorflowbutler

This comment has been minimized.

Show comment
Hide comment
@tensorflowbutler

tensorflowbutler Feb 22, 2018

Member

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

Member

tensorflowbutler commented Feb 22, 2018

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler

This comment has been minimized.

Show comment
Hide comment
@tensorflowbutler

tensorflowbutler Mar 9, 2018

Member

Nagging Assignee @ebrevdo: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

Member

tensorflowbutler commented Mar 9, 2018

Nagging Assignee @ebrevdo: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@ebrevdo

This comment has been minimized.

Show comment
Hide comment
@ebrevdo

ebrevdo Mar 10, 2018

Contributor

Marking as closed since it's available.

Contributor

ebrevdo commented Mar 10, 2018

Marking as closed since it's available.

@ebrevdo ebrevdo closed this Mar 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment