New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU impl. of hue adjustment op #6818
Conversation
Can one of the admins verify this patch? |
@@ -0,0 +1,64 @@ | |||
#include "tensorflow/core/framework/op_kernel.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need the LICENSE text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do that, thanks!
@@ -0,0 +1,142 @@ | |||
#if GOOGLE_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need the LICENSE text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix that, thanks!
Thanks a lot for the contribution! I'll look into it. |
BTW, by the following test, run on a machine with 32 GB RAM, a 6-core Intel Core i7-5930K CPU @ 3.50GHz and an NVIDIA GTX 1080 GPU produced a GPU-based speedup factor of 17.3, given both the use of a fused CPU and GPU kernel. This was based on a test with a 4D random image tensor with a batch size of 64 and "image" size of 256x256x3. The GTX 1080 throughput 31,950 images per second, while the Core i7-5930K throughput was 1,845 images per second. Longer running times, e.g. running tf.while_loop 100k times, can bring the speedup factor to about 30x on this reference hardware, especially for real images where pixels' hues are spatially correlated (which reduces GPU warp divergence in the RGB-to-HSV part of the kernel). Both 3D and 4D tensors work, but of course 4D tensors perform better due to batching. Note that for uint8 inputs (e.g. from tf.image_jpeg_decode), the results of the CPU and GPU kernel are identical, pixel for pixel. For inputs of int32, with an input numeric range of [0, 256], a small fraction of pixels differ by up to 0.3% in intensity (1/255), which is caused by the fact that the Python wrapper for adjust_hue() does scaling, and scaling a range of [0, 255] by 1E31 causes precision problems. That's a very small error for that unusual case. Numbers in range of [0, 255] represented as floats, or as uint8 (scaled by 255 for float conversion by convert_image_dtype() inside tf.image_adjust_hue()) return exactly the same results on the CPU and GPU. |
virtual void DoCompute(OpKernelContext* context, | ||
const ComputeOptions& options) = 0; | ||
|
||
void Compute(OpKernelContext* context) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is too much code in the header file. You can leave the declaration here, but move the definition back to the .cc file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks!
|
||
typedef struct RgbTuple { | ||
|
||
float r, g, b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most tensorflow code goes without the empty line above and below, and
float r;
float g;
float b;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
// hue | ||
if (chroma > 0.0f) { | ||
if (M == r) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistently remove empty lines at the beginning and end of each block.
In general, TensorFlow wants to fit more content within one page. Adding new line only between important functional blocks. There is no need for cosmetic new lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
__global__ void adjust_hue_nhwc(const int number_elements, | ||
const float * const input, float * const output, const float * const hue_delta) { | ||
|
||
const float delta = hue_delta[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the indentation to two spaces consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
output[idx + 2] = rgb.b; | ||
|
||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave a comment marking the end of the namespace.
const float * const input_data = input->flat<float>().data(); | ||
const float * const delta_h = delta->flat<float>().data(); | ||
float * const output_data = output->flat<float>().data(); | ||
const int threads_per_block = config.thread_per_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indentation.
const float * const delta_h = delta->flat<float>().data(); | ||
float * const output_data = output->flat<float>().data(); | ||
const int threads_per_block = config.thread_per_block; | ||
const int block_count = (number_elements + threads_per_block - 1) / threads_per_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This or Eigen::divup. Really up to you.
const HsvTuple hsv = rgb2hsv_cuda(input[idx], input[idx+1], input[idx+2]); | ||
|
||
// hue adjustment | ||
const float new_h = fmodf(hsv.h + delta, 1.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does fmodf handle negative delta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq It does, plus since the fmodf is around 1.0f, the delta can be even a multiple of 1 or -1. But at a minimum, it does support the range [-1, 1].
|
||
# TODO(zhengxq): we will switch to the fused version after we add a GPU | ||
# kernel for that. | ||
fused = os.environ.get('TF_ADJUST_HUE_FUSED', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For safety concern, leave the default as "fused". If the user specified "0", or "false", use the old code path.
After a few weeks, if we don't see anything broken, we will remove the old path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
|
||
TF_EXPECT_OK(InitOp()); | ||
|
||
AddInputFromArray<float>(TensorShape({1, 1, 1, 3}), {127, 137, 225}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lof of python tests under
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/image_ops_test.py
Does this test do something special?
Please make sure all the tests will use the new GPU kernel, and pass after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, in that case, I'll just use the Python one, that should be enough.
@zheng-xq I made all the changes based on your requests. I think it's in a state that's acceptable based on your first review. The tests pass, as do my own experiments, for positive and negative delta. The code was refactored according to your spec. Please let me know if this is good enough to run the CI process. |
Jenkins, test this please. |
@drpngx @zheng-xq Everything that depends on the sanity checks step fails, because sanity checks don't pass. Looking at the log, I get the following message:
I never encountered this problem while building locally. I can make that change, but I'm wondering if that won't trigger other problems. Is there a way to run this sanity check step locally, without the entire CI infra, to see if that resolves the problem? Even in the absence of this change, I can build the _pywrap_tensorflow.so and pass //tensorflow/python:image_ops_test, etc. Please let me know how you would like me to proceed. |
@drpngx @zheng-xq While running ./tensorflow/tools/ci_build/ci_sanity.sh locally, I get
So, I can't iterate locally to address this. Moreover, I seem to be getting another set of messages locally that the execution of the sanity check script doesn't seem to produce for you on Jenkins:
These don't seem to show up in your CI build. Please let me know what I can do to make it work. |
You can do the link check with our script. It's basically asking you to sort. |
@drpngx Ah ok, great! Thank you! I'll check this ASAP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the new change! It looks much better now!
A few more comments enclosed.
tensorflow/core/kernels/BUILD
Outdated
"//tensorflow/core:lib", | ||
"//third_party/eigen3", | ||
], | ||
linkstatic = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "linkstatic = 1" necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq It's not, I just checked, I'll remove it.
|
||
class AdjustHueOpBase : public OpKernel { | ||
|
||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation within this class is lost. Please refer to the style guide as a reference.
https://google.github.io/styleguide/cppguide.html#Class_Format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq I'll fix that, thanks! I just discovered the Eclipse formatting style template to make sure the code follows Google's indentation rules now. I hope that's enough to ensure following style.
}; | ||
|
||
template <class Device> | ||
class AdjustHueOp {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to have a class definition at all, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq I think it will be possible to remove it from this file when the op part of the GPU impl gets moved to the same compilation unit as the CPU code, and the GPU file only has a functor, as you pointed out below.
// hue adjustment | ||
float new_h = fmodf(hsv.h + delta, 1.0f); | ||
if (new_h < 0.0f) { | ||
new_h = fmodf(1.0f + new_h, 1.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this way much better. :)
Just to confirm, a conditional fmodf is actually faster than a unconditional fmodf? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq I ran a small test (64x256x256x3 batch, repeated 1,000 times) and the conditional fmodf is actually faster by 2% for a positive hue delta. For a negative delta, it's 4% faster. These are values for a GTX 1080. I'm not sure about Kepler or Maxwell, but it definitely helps for Pascal.
|
||
template <> | ||
class AdjustHueOp<GPUDevice> : public AdjustHueOpBase { | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. The indentation within the class is lost.
https://google.github.io/styleguide/cppguide.html#Class_Format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq I'll fix that, thanks!
} | ||
}; | ||
|
||
REGISTER_KERNEL_BUILDER(Name("AdjustHue").Device(DEVICE_GPU), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I am a bit surprised that this worked at all.
Older nvcc cannot handle OpKernel and its registration process. So we had to do all the OpKernel in the CPU translation unit, and only call out an launch function into the Cuda translation unit.
It might be safer to follow that convention for the multiple Cuda compilers we are supporting. Here is an example, there are many in the kernels directory.
- GPU kernel registration in the CPU translation unit:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops.cc#L315 - The kernel does all the OpKernel related things in this file, and call out to its GPU functor at. Only simpler data structure crosses the file boundary.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops.cc#L291 - The GPU functor is defined here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc#L59 - And instantiated here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc#L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq It worked with nvcc from CUDA 8, but yes, I haven't checked with older compilers. I can try doing it the way you suggested if that helps generalize the build.
if (idx > number_elements) { | ||
return; | ||
} | ||
const HsvTuple hsv = rgb2hsv_cuda(input[idx], input[idx+1], input[idx+2]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is it faster to have uncoalesced memory access, or staging through shared memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, did you include the change to flip the default? It's okay to leave it as a different CL if you want. Also we need to enable more tests on GPU by setting this to True: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/image_ops_test.py#L302 You can test the GPU version with --config=cuda, and the CPU version without --config=cuda. |
@zheng-xq Yes, I changed the default back to the unfused kernels, so now the env var is needed again, like it used to be. I'll make the change in the test as well. Thanks! |
@tensorflow-jenkins test this please |
@mkolod we're swamped because of the dev summit tomorrow -- XQ will have more time afterwards :) |
@vrv No worries! See you at the summit! :) |
@zheng-xq Could you review it? Thanks! :) |
if (M == r) { | ||
const float num = (g - b) / chroma; | ||
const float sgn = num < 0.0f; | ||
const float sign = powf(-1.0f, sgn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just implementing copysign()? Surely copysign() would be faster than invoking powf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlebar nvcc doesn't offer copysign for device functions. There is signbit, but that only addresses line 48, not 49. One can of course instead do "sgn ? -1 : 1" on line 49, but conditionals cause warp divergence. Comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlebar Of course one can also implement an ugly device solution
template int sgn(T val) {
return (T(0) < val) - (val < T(0));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
2 * signbit(val) - 1
because signbit is implemented in the CUDA math lib, but copysign is not. So I'll do it that way since it's branchless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW sgn ? -1 : 1
should be lowered using PTX select, which does not cause divergence. But any of these options seems much better than powf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(FTR, since I checked, clang has no problem with copysign, because we implement all of C++11 math.
$ echo '__global__ void f(float a, float* b) { *b = copysign(1, a); }' |\
clang++ -O2 -x cuda -c -S --cuda-device-only - -o -
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not in clang, but I'm having issues with gcc (4.8.4) interacting with nvcc (CUDA 8.0.53).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I totally believe that nvcc doesn't handle this correctly. Sorry, didn't want to suggest that you needed to use copysign, just that we will be able to use copysign once we've rid ourselves of the nvcc dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it turns out that the reason was the issue of a conflict between copysign in gcc and nvcc. Since in nvcc, copysign delegates to the device function copysignf, and one of the imported headers must have transitively imported std::copysign from cmath (I assume removing std via using), there was a conflict. Spelling out the name as copysignf makes things work just fine with gcc + nvcc in the mix. The same goes for using, say, fmodf rather than fmod - it ensures no ambiguity even when cmath functions are "used."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(FTR that's another problem that I believe we've fixed in clang. :)
In the absence of a benchmark, I don't care if we do copysign or a select on signbit, although having looked at the ptx, I expect that the signbit select will be faster. fmodf is already quite involved, so it might be lost in the noise.
#include "tensorflow/core/util/cuda_kernel_helper.h" | ||
|
||
namespace tensorflow { | ||
namespace internal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please place all of this code in an anonymous namespace so that it doesn't generate ODR violations if someone else declares e.g. "RgbTuple" in namespace tensorflow::internal.
You can add the anon ns inside ::tensorflow::internal ad everything should keep working the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can place the functions in an anon ns too.
namespace tensorflow { namespace internal { namespace { ... }}}
RgbTuple tuple; | ||
const float new_h = h * 6.0f; | ||
const float chroma = v * s; | ||
const float x = chroma * (1.0 - fabsf(fmodf(new_h, 2.0f) - 1.0f)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 1.0f, otherwise we're forced to cast up to double precision for the subtraction and multiplication.
Given how easy it is to screw this up, I'm tempted to say we should just use integers everywhere and rely on the compiler to upcast to float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
); | ||
} | ||
} | ||
} // namespace functor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two spaces before comments.
Please run clang-format over this patch. It looks like we're over 80 chars in some places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes! We are getting close now. Only a few minor comments.
template <> | ||
class AdjustHueOp<CPUDevice> : public AdjustHueOpBase { | ||
template<typename Device> | ||
class AdjustHueOp : public OpKernel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like the AdjustHueOpBase approach better. Since it reduces template duplications, and leads to smaller code size. But if something makes it very difficult to use, I am fine with change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-xq Somehow I thought that you didn't want the subclassing, in addition to suggesting the functor pattern to address the GPU compilation units. If it's OK to leave for now, I can fix that when I do the saturation PR later. Hopefully it's not too much template bloat relative to Eigen. :)
#include "tensorflow/core/framework/register_types.h" | ||
#include "tensorflow/core/framework/tensor.h" | ||
#include "tensorflow/core/framework/tensor_shape.h" | ||
#include "tensorflow/core/framework/types.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I didn't make this clear the first time. In general, we cannot make any major classes from "tensorflow/core/framework" visible to .cu.cc. That means no Tensor, no OpKernelContext there. When we pull the code inside, they might not work.
The common practice is to only pass either Eigen tensor, or just raw pointers between the .cc and .cu.cc files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK, I wish I'd known that earlier! :)
Please ping me if and when you'd like me to look at this again. |
@mkolod Ping, it looks like this is really close to merging. Also, there's a conflict to resolve now. |
Thanks @dandelionmane, will make changes, time permitting. |
@zheng-xq Any remaining concerns? Regarding formatting, I ended up using |
#include "tensorflow/core/util/cuda_kernel_helper.h" | ||
|
||
namespace tensorflow { | ||
namespace internal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can place the functions in an anon ns too.
namespace tensorflow { namespace internal { namespace { ... }}}
Hm, didn't mean to approve globally. Definitely still need XQ's sign-off. |
@zheng-xq friendly ping |
FYI as a general rule, we shouldn't expect reviewers to review code with less latency than it takes reviewees to respond to comments. If nothing else, this encourages people to respond to comments quickly, which saves reviewer time, letting them keep reviews in cache. |
LGTM. Thanks for the contribution! |
This is related to issue #6817. Mentioning @zheng-xq since he implemented the fused CPU hue adjustment kernel.