Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU impl. of hue adjustment op #6818

Merged
merged 1 commit into from Mar 14, 2017
Merged

GPU impl. of hue adjustment op #6818

merged 1 commit into from Mar 14, 2017

Conversation

mkolod
Copy link
Contributor

@mkolod mkolod commented Jan 13, 2017

This is related to issue #6817. Mentioning @zheng-xq since he implemented the fused CPU hue adjustment kernel.

@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@@ -0,0 +1,64 @@
#include "tensorflow/core/framework/op_kernel.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the LICENSE text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that, thanks!

@@ -0,0 +1,142 @@
#if GOOGLE_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the LICENSE text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix that, thanks!

@drpngx drpngx added the awaiting review Pull request awaiting review label Jan 13, 2017
@zheng-xq
Copy link
Contributor

Thanks a lot for the contribution! I'll look into it.

@mkolod
Copy link
Contributor Author

mkolod commented Jan 13, 2017

BTW, by the following test, run on a machine with 32 GB RAM, a 6-core Intel Core i7-5930K CPU @ 3.50GHz and an NVIDIA GTX 1080 GPU produced a GPU-based speedup factor of 17.3, given both the use of a fused CPU and GPU kernel. This was based on a test with a 4D random image tensor with a batch size of 64 and "image" size of 256x256x3. The GTX 1080 throughput 31,950 images per second, while the Core i7-5930K throughput was 1,845 images per second. Longer running times, e.g. running tf.while_loop 100k times, can bring the speedup factor to about 30x on this reference hardware, especially for real images where pixels' hues are spatially correlated (which reduces GPU warp divergence in the RGB-to-HSV part of the kernel). Both 3D and 4D tensors work, but of course 4D tensors perform better due to batching. Note that for uint8 inputs (e.g. from tf.image_jpeg_decode), the results of the CPU and GPU kernel are identical, pixel for pixel. For inputs of int32, with an input numeric range of [0, 256], a small fraction of pixels differ by up to 0.3% in intensity (1/255), which is caused by the fact that the Python wrapper for adjust_hue() does scaling, and scaling a range of [0, 255] by 1E31 causes precision problems. That's a very small error for that unusual case. Numbers in range of [0, 255] represented as floats, or as uint8 (scaled by 255 for float conversion by convert_image_dtype() inside tf.image_adjust_hue()) return exactly the same results on the CPU and GPU.

virtual void DoCompute(OpKernelContext* context,
const ComputeOptions& options) = 0;

void Compute(OpKernelContext* context) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is too much code in the header file. You can leave the declaration here, but move the definition back to the .cc file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!


typedef struct RgbTuple {

float r, g, b;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tensorflow code goes without the empty line above and below, and

float r;
float g;
float b;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

// hue
if (chroma > 0.0f) {
if (M == r) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistently remove empty lines at the beginning and end of each block.

In general, TensorFlow wants to fit more content within one page. Adding new line only between important functional blocks. There is no need for cosmetic new lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

__global__ void adjust_hue_nhwc(const int number_elements,
const float * const input, float * const output, const float * const hue_delta) {

const float delta = hue_delta[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the indentation to two spaces consistently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

output[idx + 2] = rgb.b;

}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a comment marking the end of the namespace.

const float * const input_data = input->flat<float>().data();
const float * const delta_h = delta->flat<float>().data();
float * const output_data = output->flat<float>().data();
const int threads_per_block = config.thread_per_block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation.

const float * const delta_h = delta->flat<float>().data();
float * const output_data = output->flat<float>().data();
const int threads_per_block = config.thread_per_block;
const int block_count = (number_elements + threads_per_block - 1) / threads_per_block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This or Eigen::divup. Really up to you.

const HsvTuple hsv = rgb2hsv_cuda(input[idx], input[idx+1], input[idx+2]);

// hue adjustment
const float new_h = fmodf(hsv.h + delta, 1.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does fmodf handle negative delta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq It does, plus since the fmodf is around 1.0f, the delta can be even a multiple of 1 or -1. But at a minimum, it does support the range [-1, 1].


# TODO(zhengxq): we will switch to the fused version after we add a GPU
# kernel for that.
fused = os.environ.get('TF_ADJUST_HUE_FUSED', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety concern, leave the default as "fused". If the user specified "0", or "false", use the old code path.

After a few weeks, if we don't see anything broken, we will remove the old path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!


TF_EXPECT_OK(InitOp());

AddInputFromArray<float>(TensorShape({1, 1, 1, 3}), {127, 137, 225});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lof of python tests under

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/image_ops_test.py

Does this test do something special?

Please make sure all the tests will use the new GPU kernel, and pass after this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, in that case, I'll just use the Python one, that should be enough.

@mkolod
Copy link
Contributor Author

mkolod commented Jan 20, 2017

@zheng-xq I made all the changes based on your requests. I think it's in a state that's acceptable based on your first review. The tests pass, as do my own experiments, for positive and negative delta. The code was refactored according to your spec. Please let me know if this is good enough to run the CI process.

@drpngx
Copy link
Contributor

drpngx commented Jan 23, 2017

Jenkins, test this please.

@mkolod
Copy link
Contributor Author

mkolod commented Jan 23, 2017

@drpngx @zheng-xq Everything that depends on the sanity checks step fails, because sanity checks don't pass. Looking at the log, I get the following message:

FAIL: buildifier found errors and/or warnings in above BUILD files.
buildifier suggested the following changes:
1530a1531

linkstatic = 1,

1537d1537
< linkstatic = 1,
1542d1541
< prefix = "adjust_hue_op",
1543a1543

prefix = "adjust_hue_op",

Please fix manually or run buildifier to auto-fix.

I never encountered this problem while building locally. I can make that change, but I'm wondering if that won't trigger other problems. Is there a way to run this sanity check step locally, without the entire CI infra, to see if that resolves the problem? Even in the absence of this change, I can build the _pywrap_tensorflow.so and pass //tensorflow/python:image_ops_test, etc.

Please let me know how you would like me to proceed.

@mkolod
Copy link
Contributor Author

mkolod commented Jan 23, 2017

@drpngx @zheng-xq While running ./tensorflow/tools/ci_build/ci_sanity.sh locally, I get

FAIL: buildifier found errors and/or warnings in above BUILD files.
buildifier suggested the following changes:
./tensorflow/tools/ci_build/ci_sanity.sh: line 285: buildifier: command not found

So, I can't iterate locally to address this.

Moreover, I seem to be getting another set of messages locally that the execution of the sanity check script doesn't seem to produce for you on Jenkins:

FAIL: mismatch in packaged licenses and external dependencies
Please remove the licenses for the following external dependencies:
//third_party/hadoop
@boringssl//
@curl//
@jemalloc//

These don't seem to show up in your CI build.

Please let me know what I can do to make it work.

@drpngx
Copy link
Contributor

drpngx commented Jan 23, 2017

You can do the link check with our script.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/ci_sanity.sh#L84

It's basically asking you to sort.

@mkolod
Copy link
Contributor Author

mkolod commented Jan 23, 2017

@drpngx Ah ok, great! Thank you! I'll check this ASAP.

Copy link
Contributor

@zheng-xq zheng-xq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the new change! It looks much better now!

A few more comments enclosed.

"//tensorflow/core:lib",
"//third_party/eigen3",
],
linkstatic = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "linkstatic = 1" necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq It's not, I just checked, I'll remove it.


class AdjustHueOpBase : public OpKernel {

protected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation within this class is lost. Please refer to the style guide as a reference.

https://google.github.io/styleguide/cppguide.html#Class_Format

Copy link
Contributor Author

@mkolod mkolod Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq I'll fix that, thanks! I just discovered the Eclipse formatting style template to make sure the code follows Google's indentation rules now. I hope that's enough to ensure following style.

};

template <class Device>
class AdjustHueOp {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to have a class definition at all, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq I think it will be possible to remove it from this file when the op part of the GPU impl gets moved to the same compilation unit as the CPU code, and the GPU file only has a functor, as you pointed out below.

// hue adjustment
float new_h = fmodf(hsv.h + delta, 1.0f);
if (new_h < 0.0f) {
new_h = fmodf(1.0f + new_h, 1.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this way much better. :)

Just to confirm, a conditional fmodf is actually faster than a unconditional fmodf? Thanks.

Copy link
Contributor Author

@mkolod mkolod Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq I ran a small test (64x256x256x3 batch, repeated 1,000 times) and the conditional fmodf is actually faster by 2% for a positive hue delta. For a negative delta, it's 4% faster. These are values for a GTX 1080. I'm not sure about Kepler or Maxwell, but it definitely helps for Pascal.


template <>
class AdjustHueOp<GPUDevice> : public AdjustHueOpBase {
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The indentation within the class is lost.

https://google.github.io/styleguide/cppguide.html#Class_Format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq I'll fix that, thanks!

}
};

REGISTER_KERNEL_BUILDER(Name("AdjustHue").Device(DEVICE_GPU),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I am a bit surprised that this worked at all.

Older nvcc cannot handle OpKernel and its registration process. So we had to do all the OpKernel in the CPU translation unit, and only call out an launch function into the Cuda translation unit.

It might be safer to follow that convention for the multiple Cuda compilers we are supporting. Here is an example, there are many in the kernels directory.

  1. GPU kernel registration in the CPU translation unit:
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops.cc#L315
  2. The kernel does all the OpKernel related things in this file, and call out to its GPU functor at. Only simpler data structure crosses the file boundary.
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops.cc#L291
  3. The GPU functor is defined here:
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc#L59
  4. And instantiated here:
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc#L95

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq It worked with nvcc from CUDA 8, but yes, I haven't checked with older compilers. I can try doing it the way you suggested if that helps generalize the build.

if (idx > number_elements) {
return;
}
const HsvTuple hsv = rgb2hsv_cuda(input[idx], input[idx+1], input[idx+2]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is it faster to have uncoalesced memory access, or staging through shared memory?

Copy link
Contributor Author

@mkolod mkolod Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq It seems that shared memory didn't help that much, but the load issue gets addressed quite well by the texture cache by using __restrict__ on the input pointer (see here for more details).

@zheng-xq
Copy link
Contributor

In addition, did you include the change to flip the default? It's okay to leave it as a different CL if you want.

Also we need to enable more tests on GPU by setting this to True:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/image_ops_test.py#L302

You can test the GPU version with --config=cuda, and the CPU version without --config=cuda.

@drpngx drpngx added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Jan 24, 2017
@mkolod
Copy link
Contributor Author

mkolod commented Feb 1, 2017

@zheng-xq Yes, I changed the default back to the unfused kernels, so now the env var is needed again, like it used to be. I'll make the change in the test as well. Thanks!

@mkolod
Copy link
Contributor Author

mkolod commented Feb 2, 2017

@zheng-xq @drpngx I'm running CI now and making a few last changes (specifically this one), so please don't review or merge yet. I'll make a comment here when the code is ready for final review, build and merge.

@mkolod
Copy link
Contributor Author

mkolod commented Feb 8, 2017

@zheng-xq Could you review this? Also, seems like Linux tests have failed due to some networking problem (see here). Could you have a look? Thanks!

@vrv
Copy link

vrv commented Feb 13, 2017

@tensorflow-jenkins test this please

@vrv vrv added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Feb 13, 2017
@vrv
Copy link

vrv commented Feb 15, 2017

@mkolod we're swamped because of the dev summit tomorrow -- XQ will have more time afterwards :)

@mkolod
Copy link
Contributor Author

mkolod commented Feb 15, 2017

@vrv No worries! See you at the summit! :)

@mkolod
Copy link
Contributor Author

mkolod commented Feb 17, 2017

@zheng-xq Could you review it? Thanks! :)

if (M == r) {
const float num = (g - b) / chroma;
const float sgn = num < 0.0f;
const float sign = powf(-1.0f, sgn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just implementing copysign()? Surely copysign() would be faster than invoking powf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlebar nvcc doesn't offer copysign for device functions. There is signbit, but that only addresses line 48, not 49. One can of course instead do "sgn ? -1 : 1" on line 49, but conditionals cause warp divergence. Comments?

Copy link
Contributor Author

@mkolod mkolod Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlebar Of course one can also implement an ugly device solution

template int sgn(T val) {
return (T(0) < val) - (val < T(0));
}

Copy link
Contributor Author

@mkolod mkolod Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or
2 * signbit(val) - 1
because signbit is implemented in the CUDA math lib, but copysign is not. So I'll do it that way since it's branchless.

Copy link
Contributor

@jlebar jlebar Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW sgn ? -1 : 1 should be lowered using PTX select, which does not cause divergence. But any of these options seems much better than powf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FTR, since I checked, clang has no problem with copysign, because we implement all of C++11 math.

$ echo '__global__ void f(float a, float* b) { *b = copysign(1, a); }' |\
  clang++ -O2 -x cuda -c -S --cuda-device-only - -o -

)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in clang, but I'm having issues with gcc (4.8.4) interacting with nvcc (CUDA 8.0.53).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I totally believe that nvcc doesn't handle this correctly. Sorry, didn't want to suggest that you needed to use copysign, just that we will be able to use copysign once we've rid ourselves of the nvcc dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it turns out that the reason was the issue of a conflict between copysign in gcc and nvcc. Since in nvcc, copysign delegates to the device function copysignf, and one of the imported headers must have transitively imported std::copysign from cmath (I assume removing std via using), there was a conflict. Spelling out the name as copysignf makes things work just fine with gcc + nvcc in the mix. The same goes for using, say, fmodf rather than fmod - it ensures no ambiguity even when cmath functions are "used."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FTR that's another problem that I believe we've fixed in clang. :)

In the absence of a benchmark, I don't care if we do copysign or a select on signbit, although having looked at the ptx, I expect that the signbit select will be faster. fmodf is already quite involved, so it might be lost in the noise.

#include "tensorflow/core/util/cuda_kernel_helper.h"

namespace tensorflow {
namespace internal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please place all of this code in an anonymous namespace so that it doesn't generate ODR violations if someone else declares e.g. "RgbTuple" in namespace tensorflow::internal.

You can add the anon ns inside ::tensorflow::internal ad everything should keep working the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix that, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can place the functions in an anon ns too.

namespace tensorflow { namespace internal { namespace { ... }}}

RgbTuple tuple;
const float new_h = h * 6.0f;
const float chroma = v * s;
const float x = chroma * (1.0 - fabsf(fmodf(new_h, 2.0f) - 1.0f));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 1.0f, otherwise we're forced to cast up to double precision for the subtraction and multiplication.

Given how easy it is to screw this up, I'm tempted to say we should just use integers everywhere and rely on the compiler to upcast to float.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.

);
}
}
} // namespace functor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two spaces before comments.

Please run clang-format over this patch. It looks like we're over 80 chars in some places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor

@zheng-xq zheng-xq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes! We are getting close now. Only a few minor comments.

template <>
class AdjustHueOp<CPUDevice> : public AdjustHueOpBase {
template<typename Device>
class AdjustHueOp : public OpKernel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the AdjustHueOpBase approach better. Since it reduces template duplications, and leads to smaller code size. But if something makes it very difficult to use, I am fine with change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-xq Somehow I thought that you didn't want the subclassing, in addition to suggesting the functor pattern to address the GPU compilation units. If it's OK to leave for now, I can fix that when I do the saturation PR later. Hopefully it's not too much template bloat relative to Eigen. :)

#include "tensorflow/core/framework/register_types.h"
#include "tensorflow/core/framework/tensor.h"
#include "tensorflow/core/framework/tensor_shape.h"
#include "tensorflow/core/framework/types.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I didn't make this clear the first time. In general, we cannot make any major classes from "tensorflow/core/framework" visible to .cu.cc. That means no Tensor, no OpKernelContext there. When we pull the code inside, they might not work.

The common practice is to only pass either Eigen tensor, or just raw pointers between the .cc and .cu.cc files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK, I wish I'd known that earlier! :)

@jlebar
Copy link
Contributor

jlebar commented Feb 23, 2017

Please ping me if and when you'd like me to look at this again.

@teamdandelion
Copy link
Contributor

@mkolod Ping, it looks like this is really close to merging. Also, there's a conflict to resolve now.

@teamdandelion teamdandelion added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Mar 7, 2017
@mkolod
Copy link
Contributor Author

mkolod commented Mar 8, 2017

Thanks @dandelionmane, will make changes, time permitting.

@mkolod
Copy link
Contributor Author

mkolod commented Mar 8, 2017

@zheng-xq Any remaining concerns? Regarding formatting, I ended up using clang-format --fallback-style="Google" < input.cc so I hope that is enough to resolve formatting. I think I addressed all the other concerns for the last review as well.

#include "tensorflow/core/util/cuda_kernel_helper.h"

namespace tensorflow {
namespace internal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can place the functions in an anon ns too.

namespace tensorflow { namespace internal { namespace { ... }}}

@jlebar
Copy link
Contributor

jlebar commented Mar 9, 2017

Hm, didn't mean to approve globally. Definitely still need XQ's sign-off.

@jhseu
Copy link
Contributor

jhseu commented Mar 14, 2017

@zheng-xq friendly ping

@jlebar
Copy link
Contributor

jlebar commented Mar 14, 2017

FYI as a general rule, we shouldn't expect reviewers to review code with less latency than it takes reviewees to respond to comments. If nothing else, this encourages people to respond to comments quickly, which saves reviewer time, letting them keep reviews in cache.

@zheng-xq
Copy link
Contributor

LGTM.

Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants