-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[Done]parallelize elementwise operation with openmp #2764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thanks a lot for the contribution. we'll review it within a week (it touches core parts). I see that you've reduced the |
torch/lib/TH/generalFunc.h
Outdated
#ifndef GENERAL_FUNC_H | ||
#define GENERAL_FUNC_H | ||
|
||
ptrdiff_t SearchingIndex(ptrdiff_t index, long *stride, int dim, long* size) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/generic/THTensor.c
Outdated
{ | ||
if(self->stride[d] == 0) | ||
{ | ||
return 1; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/generic/THTensor.c
Outdated
@@ -640,6 +640,20 @@ int THTensor_(isTransposed)(const THTensor *self) | |||
return 0; | |||
} | |||
|
|||
int THTensor_(hasZeroStride)(const THTensor *self) | |||
{ | |||
long z = 1; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/generic/THTensorMath.c
Outdated
#if defined(TH_REAL_IS_BYTE) | ||
TH_TENSOR_APPLY2_ADVANCED_INDEX(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = (((real) *t_data) >> value);); | ||
#else | ||
TH_TENSOR_APPLY2_ADVANCED_INDEX(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = (((unsigned real) *t_data) >> value);); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Thanks a lot for the PR! |
@soumith As your concern, that the |
@fmassa You must have known that some confusing behavior occurs to expanding tensors, as you point in Torch Buggy cmul behavior on Tensors with 0-strides. The method to parallelize the operation use the number of tensor's element to calculate the offset in memory in the macros. But the real memory size of expanding tensor is less than that what users refer it. So that it will cause outing of bound of memory. |
@MlWoo while I agree that the result tensor having zero strides can lead to weird behaviour, input tensors should work just fine I think. Indeed, your index calculation multiplies the offset by the stride, so zero stride should not change the index? Also, a small nit: the name |
@fmassa I avoid handling that situation to avoid handling some operation involved with copying from a expanding tensor at first. I had misunderstood that here. In term of the name of Macro, how about TH_TENSOR_APPLY2_OMP? Thanks a lot for your advice. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I have only looked at the file structure and THTensorCopy.c
for now.
I will look in details the macro and THTensorMath.c
later.
It would be interesting to see the benchmark you are using so that we can run it on different machines/architectures as well.
torch/lib/TH/generic/THTensorCopy.c
Outdated
|
||
int serial_path = 0; | ||
int inOMP = omp_in_parallel(); | ||
if (tensorSize != tensorSize) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/CMakeLists.txt
Outdated
@@ -384,6 +384,7 @@ INSTALL(TARGETS TH | |||
ARCHIVE DESTINATION "${TH_INSTALL_LIB_SUBDIR}") | |||
|
|||
INSTALL(FILES | |||
generalFunc.h |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/generic/THTensorCopy.c
Outdated
int srcContig = THTensor_(isContiguous)(src); | ||
|
||
int serial_path = 0; | ||
int inOMP = omp_in_parallel(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/generic/THTensorCopy.c
Outdated
#ifdef _OPENMP | ||
int tensorZeroStride = THTensor_(hasZeroStride)(tensor); | ||
int srcZeroStride = THTensor_(hasZeroStride)(src); | ||
if (inOMP && (tensorZeroStride||srcZeroStride)) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/TH/generic/THTensorCopy.c
Outdated
TH_TENSOR_APPLY2_ADVANCED_INDEX(srcSize, tensorContig, srcContig, real, tensor, real, src, *tensor_data = *src_data;) | ||
} | ||
#else | ||
TH_TENSOR_APPLY2(real, tensor, real, src, *tensor_data = *src_data;) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@MlWoo I had another quick look at your PR (thanks once again!), I wonder if we shouldn't check for concurrent writes on the functions (which means the result tensor |
@fmassa What concurrent write is a same value. It results in only performance drop but does not make wrong result. We can probably remove the |
@MlWoo after a first quick look this looks good, but I won't have much time to look into it in detail before at least tomorrow. |
@fmassa The PR touches the core and the change is large. Thanks a lot. Look forward to your suggestion. |
@fmassa Could you spare time to review the PR? I want to optimize nn module based on the macro. |
Hi @MlWoo , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Once again thanks a lot for the PR!
I spent quite some time reviewing this PR, and my general impression is that this looks good.
But I found it very difficult to understand some parts of the code, and I think comment comments in the code would help a lot.
I initially got confused between the relationship of line_index_offset
/line_index_end
and the real iteration order that happens in the flattened tensor (which is not in the range of line_index_offset
and line_index_end
).
Could you please add some comments in the code explaining the new macros that were added, an how the iteration order happens? That would help a lot to someone new to those functions to understand the code.
Thanks!
aten/src/TH/THTensorApply.h
Outdated
ptrdiff_t TENSOR##_offset = 0; \ | ||
ptrdiff_t TENSOR##_quot = line_index_offset; \ | ||
for (TENSOR##_i = TENSOR##_dim-1; TENSOR##_i>=0; --TENSOR##_i) { \ | ||
TENSOR##_counter_tmp[TENSOR##_i] = TENSOR##_quot%TENSOR##_sizes[TENSOR##_i]; \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/TH/THTensorApply.h
Outdated
TENSOR##_offset += TENSOR##_counter_tmp[TENSOR##_i] * TENSOR##_strides[TENSOR##_i]; \ | ||
} | ||
|
||
#define __TH_TENSOR_APPLYX_UPDATE_COUNTERS_OMP(TENSOR) \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/TH/THTensorApply.h
Outdated
TENSOR2##_data += TENSOR2##_stride; \ | ||
TENSOR1##_data += TENSOR1##_stride; \ | ||
} \ | ||
if (count < line_seg_len){ \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/TH/THTensorApply.h
Outdated
for(TENSOR##_i = TENSOR##_dim - 2; (TENSOR##_i >= 0) && (TENSOR##_carry_coord); TENSOR##_i--){ \ | ||
TENSOR##_counter_tmp[TENSOR##_i]++; \ | ||
TENSOR##_data += TENSOR##_strides[TENSOR##_i]; \ | ||
if(TENSOR##_counter_tmp[TENSOR##_i] == TENSOR##_sizes[TENSOR##_i]){ \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@fmassa I have added some comments to make someone else could understand the code easily. It is not easy to explain the complex process for me in English. Plz review it and give me some suggestions. Thanks a lot. |
@fmassa The threshold of reduction operation like |
if (inOMP) { | ||
serial_path = 1; | ||
} else { | ||
TH_TENSOR_APPLY2_OMP(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = *t_data * value;) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
real *r__data = rp+iter; | ||
*r__data = 0; | ||
for(j=0; j < t->size[dimension]; ++j) { | ||
*r__data += *(t_data + j*t->stride[dimension]); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@fmassa Oh is it because this adds a whole new set of macros instead of modifying our old ones? I think we should drop the way we do these things at the moment, in favor of a more modern C++ approach. It's really hard to understand with I think a good sanity check to do before merging this would be to decrease the OMP threshold to 0 and verify that all our tests still pass. They might be running on tensors not large enough to trigger these code paths. |
@apaske yes, it creates new macros. And I agree about the tests, and that's why I installed locally and started doing some manual checks. +1 for a C++ implementation, but that can come later I think? @MlWoo I made sure that the size of the tensor were big enough so that the OMP path was activated. |
Ok, I had a closer look into the reason why numpy was showing incredibly good performances using a single thread compared to our multi-threaded implementation, and I think I have found the reason. Indeed, contrary to pytorch, the result of numpy operations on non-contiguous arrays might return non-contiguous arrays. a = torch.rand(300, 300, 1000).permute(1, 0, 2)
an = a.numpy()
# to check
print(a.stride())
# (1000L, 300000L, 1L)
print(an.strides)
# (4000, 1200000, 4)
# now let's perform some operations
print((a * 2).stride())
# gives (300000L, 1000L, 1L), a contiguous tensor
print((an * 2).strides)
# gives (4000, 1200000, 4), the same as an For contiguous tensors, in my small tests the performance of pytorch was on par with numpy in the single threaded case (as both seem to leverage SIMD instructions), including for operations like Since the beginning the contract in pytorch (and lua torch) was that (almost) all operations return a contiguous tensor, even if the inputs are non-contiguous. This doesn't seem to be the case, and lead to simple benchmarks leaning towards numpy being much faster. The question is, in real pipelines with lots of operations, is there a value in keeping the original strides of the tensor à la numpy, in order to get better runtimes? Does it change the result in a significant manner? |
@pytorchbot test this please |
@MlWoo i want to get this merged. can you rebase once on top of master. |
@pytorchbot add to whitelist (also added via oss-ci) |
@pytorchbot add to whitelist |
@soumith Test passed in different environments. All work is done. Thanks. |
@MlWoo thank you so much for this PR! |
Thanks a lot @MlWoo , and sorry for the delay in reviewing it! |
Most of elementwise operations of discontiguous THTensor such as copy, addition, multiplication and so on are serial with CPU backend, and the openmp overhead theshold is too high. This commit will parallelize elementwise operation of discontiguous THTensor with openmp.