Skip to content

[Done]parallelize elementwise operation with openmp #2764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 23, 2018
Merged

[Done]parallelize elementwise operation with openmp #2764

merged 5 commits into from
Jan 23, 2018

Conversation

MlWoo
Copy link
Contributor

@MlWoo MlWoo commented Sep 18, 2017

Most of elementwise operations of discontiguous THTensor such as copy, addition, multiplication and so on are serial with CPU backend, and the openmp overhead theshold is too high. This commit will parallelize elementwise operation of discontiguous THTensor with openmp.

@soumith soumith requested a review from killeent September 18, 2017 15:21
@soumith
Copy link
Member

soumith commented Sep 18, 2017

thanks a lot for the contribution. we'll review it within a week (it touches core parts).

I see that you've reduced the TH_OMP_OVERHEAD_THRESHOLD from 100k elements to 4k elements. Have you done some performance comparisons to make sure that this is okay? 4k elements seems small.

#ifndef GENERAL_FUNC_H
#define GENERAL_FUNC_H

ptrdiff_t SearchingIndex(ptrdiff_t index, long *stride, int dim, long* size)

This comment was marked as off-topic.

This comment was marked as off-topic.

{
if(self->stride[d] == 0)
{
return 1;

This comment was marked as off-topic.

This comment was marked as off-topic.

@@ -640,6 +640,20 @@ int THTensor_(isTransposed)(const THTensor *self)
return 0;
}

int THTensor_(hasZeroStride)(const THTensor *self)
{
long z = 1;

This comment was marked as off-topic.

#if defined(TH_REAL_IS_BYTE)
TH_TENSOR_APPLY2_ADVANCED_INDEX(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = (((real) *t_data) >> value););
#else
TH_TENSOR_APPLY2_ADVANCED_INDEX(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = (((unsigned real) *t_data) >> value););

This comment was marked as off-topic.

@fmassa
Copy link
Member

fmassa commented Sep 18, 2017

Thanks a lot for the PR!
I did a first quick pass and I have some small comments. I didn't check in details the macros yet (nor why you avoid handling expanded input tensors tensors in parallel).
Also, as @soumith pointed out, do you have some performance comparisons between the previous behavior and the one proposed in this PR?

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 19, 2017

@soumith As your concern, that the TH_OMP_OVERHEAD_THRESHOLD value is set to 4k is really controversial. 4k is an empirical value from a our previous case. And we strongly believe that the value is dependent on specifical CPU platform. We want to provide a benchmark to explain it and hope to set the value according to the CPU when compiling the code. You may want to reproduce the performance of benchmark. But we focus on the performance CPU in servers like Xeon and Xeon phi which have at least 22 cores. I am afraid that these CPUs are unavailuable to you. We also have CPUs in desktop like i7-5960x. I think that you can provide some models of CPU and we will add these models of CPU to our benchmark as far as possible so that you can reproduce the result conveniently.

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 19, 2017

@fmassa You must have known that some confusing behavior occurs to expanding tensors, as you point in Torch Buggy cmul behavior on Tensors with 0-strides. The method to parallelize the operation use the number of tensor's element to calculate the offset in memory in the macros. But the real memory size of expanding tensor is less than that what users refer it. So that it will cause outing of bound of memory.

@fmassa
Copy link
Member

fmassa commented Sep 19, 2017

@MlWoo while I agree that the result tensor having zero strides can lead to weird behaviour, input tensors should work just fine I think.

Indeed, your index calculation multiplies the offset by the stride, so zero stride should not change the index?
And because we are talking about input tensors, there is only read access happening, not write.
But I might probably be missing something here :)

Also, a small nit: the name TH_TENSOR_APPLY2_ADVANCED_INDEX might lead to confusion, as advanced index is a specific operation in pytorch/numpy, so might be worth changing the name to something else (I don't have a better name now).

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 19, 2017

@fmassa I avoid handling that situation to avoid handling some operation involved with copying from a expanding tensor at first. I had misunderstood that here.
But your view is very correct. And your view make me cognitive snap. Now I know the index calculation could be also used to calculate the index in a expanding tensor. I will modify the code and later.

In term of the name of Macro, how about TH_TENSOR_APPLY2_OMP?

Thanks a lot for your advice.

@soumith
Copy link
Member

soumith commented Sep 19, 2017

TH_TENSOR_APPLY2_OMP sounds good.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

I have only looked at the file structure and THTensorCopy.c for now.
I will look in details the macro and THTensorMath.c later.

It would be interesting to see the benchmark you are using so that we can run it on different machines/architectures as well.


int serial_path = 0;
int inOMP = omp_in_parallel();
if (tensorSize != tensorSize) {

This comment was marked as off-topic.

This comment was marked as off-topic.

@@ -384,6 +384,7 @@ INSTALL(TARGETS TH
ARCHIVE DESTINATION "${TH_INSTALL_LIB_SUBDIR}")

INSTALL(FILES
generalFunc.h

This comment was marked as off-topic.

This comment was marked as off-topic.

int srcContig = THTensor_(isContiguous)(src);

int serial_path = 0;
int inOMP = omp_in_parallel();

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

#ifdef _OPENMP
int tensorZeroStride = THTensor_(hasZeroStride)(tensor);
int srcZeroStride = THTensor_(hasZeroStride)(src);
if (inOMP && (tensorZeroStride||srcZeroStride)) {

This comment was marked as off-topic.

This comment was marked as off-topic.

TH_TENSOR_APPLY2_ADVANCED_INDEX(srcSize, tensorContig, srcContig, real, tensor, real, src, *tensor_data = *src_data;)
}
#else
TH_TENSOR_APPLY2(real, tensor, real, src, *tensor_data = *src_data;)

This comment was marked as off-topic.

This comment was marked as off-topic.

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 19, 2017

@soumith @albanD Could you provide some names of model of CPUs which are available to you? I think it will be convenient to reproduce the benchmark for you. We will find the same models of CPU in our cooperation to test the performance.

@fmassa
Copy link
Member

fmassa commented Sep 20, 2017

@MlWoo I had another quick look at your PR (thanks once again!), I wonder if we shouldn't check for concurrent writes on the functions (which means the result tensor r_ maybe shouldn't have zero strides), but I'm not sure and someone with more experience on OMP should definitely comment on this.
If what I said is not necessary, then we can probably remove the hasZeroStride function, as it's not used anywhere in the code anymore.

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 21, 2017

@fmassa What concurrent write is a same value. It results in only performance drop but does not make wrong result. We can probably remove the hasZeroStride function.

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 22, 2017

@soumith @fmassa @albanD We have released some data and benchmark. Could you spare time to review it? We will release more data in next week. Thanks a lot.

@MlWoo MlWoo changed the title parallelize elementwise operation with openmp [WIP]parallelize elementwise operation with openmp Sep 25, 2017
@fmassa
Copy link
Member

fmassa commented Sep 27, 2017

@MlWoo I had a quick look at the code for the benchmark and it looks good, thanks!
I have a quick question: for the comparison before your optimizations, did you use pytorch after adding this PR #2792 or was it before it?
Thanks!

@MlWoo
Copy link
Contributor Author

MlWoo commented Sep 29, 2017

@fmassa The comparison of official pytorch was before that PR #2792 . We will evaluate the performance after adding this PR later again.

@MlWoo
Copy link
Contributor Author

MlWoo commented Dec 14, 2017

@fmassa A more efficient method to accelerate basic operations is implemented. Could you spare time to review the code? The benchmark has also updated. We will pull a new request to modify the TH_OMP_OVERHEAD_THRESHOLD value later. Thanks a lot.

@MlWoo MlWoo changed the title [WIP]parallelize elementwise operation with openmp [Done]parallelize elementwise operation with openmp Dec 14, 2017
@fmassa
Copy link
Member

fmassa commented Dec 14, 2017

Hey, thanks a lot for the improvements. I'll try to have a look later today.
Also, @gchanan is working on implementing the same macros on ATen in #4161, I think it might be worth at some point applying this patches there

@fmassa
Copy link
Member

fmassa commented Dec 14, 2017

@MlWoo after a first quick look this looks good, but I won't have much time to look into it in detail before at least tomorrow.

@MlWoo
Copy link
Contributor Author

MlWoo commented Dec 18, 2017

@fmassa The PR touches the core and the change is large. Thanks a lot. Look forward to your suggestion.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 9, 2018

@fmassa Could you spare time to review the PR? I want to optimize nn module based on the macro.

@fmassa
Copy link
Member

fmassa commented Jan 12, 2018

Hi @MlWoo ,
Sorry for the delay. I'll get this reviewed this Monday!

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Once again thanks a lot for the PR!

I spent quite some time reviewing this PR, and my general impression is that this looks good.

But I found it very difficult to understand some parts of the code, and I think comment comments in the code would help a lot.

I initially got confused between the relationship of line_index_offset/line_index_end and the real iteration order that happens in the flattened tensor (which is not in the range of line_index_offset and line_index_end).

Could you please add some comments in the code explaining the new macros that were added, an how the iteration order happens? That would help a lot to someone new to those functions to understand the code.

Thanks!

ptrdiff_t TENSOR##_offset = 0; \
ptrdiff_t TENSOR##_quot = line_index_offset; \
for (TENSOR##_i = TENSOR##_dim-1; TENSOR##_i>=0; --TENSOR##_i) { \
TENSOR##_counter_tmp[TENSOR##_i] = TENSOR##_quot%TENSOR##_sizes[TENSOR##_i]; \

This comment was marked as off-topic.

TENSOR##_offset += TENSOR##_counter_tmp[TENSOR##_i] * TENSOR##_strides[TENSOR##_i]; \
}

#define __TH_TENSOR_APPLYX_UPDATE_COUNTERS_OMP(TENSOR) \

This comment was marked as off-topic.

TENSOR2##_data += TENSOR2##_stride; \
TENSOR1##_data += TENSOR1##_stride; \
} \
if (count < line_seg_len){ \

This comment was marked as off-topic.

for(TENSOR##_i = TENSOR##_dim - 2; (TENSOR##_i >= 0) && (TENSOR##_carry_coord); TENSOR##_i--){ \
TENSOR##_counter_tmp[TENSOR##_i]++; \
TENSOR##_data += TENSOR##_strides[TENSOR##_i]; \
if(TENSOR##_counter_tmp[TENSOR##_i] == TENSOR##_sizes[TENSOR##_i]){ \

This comment was marked as off-topic.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 16, 2018

@fmassa I have added some comments to make someone else could understand the code easily. It is not easy to explain the complex process for me in English. Plz review it and give me some suggestions. Thanks a lot.

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 19, 2018

@fmassa The threshold of reduction operation like sum is fixed to 5000 for the moment. It won't utilize multi-thread if the tensor size is not greater than 5000. I am not sure wether it plays effect on that or not.

if (inOMP) {
serial_path = 1;
} else {
TH_TENSOR_APPLY2_OMP(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = *t_data * value;)

This comment was marked as off-topic.

This comment was marked as off-topic.

real *r__data = rp+iter;
*r__data = 0;
for(j=0; j < t->size[dimension]; ++j) {
*r__data += *(t_data + j*t->stride[dimension]);

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Jan 19, 2018

@fmassa Oh is it because this adds a whole new set of macros instead of modifying our old ones? I think we should drop the way we do these things at the moment, in favor of a more modern C++ approach. It's really hard to understand with TENSOR_##SOMETHING all around the place.

I think a good sanity check to do before merging this would be to decrease the OMP threshold to 0 and verify that all our tests still pass. They might be running on tensors not large enough to trigger these code paths.

@fmassa
Copy link
Member

fmassa commented Jan 19, 2018

@apaske yes, it creates new macros. And I agree about the tests, and that's why I installed locally and started doing some manual checks. +1 for a C++ implementation, but that can come later I think?

@MlWoo I made sure that the size of the tensor were big enough so that the OMP path was activated.

@fmassa
Copy link
Member

fmassa commented Jan 22, 2018

Ok, I had a closer look into the reason why numpy was showing incredibly good performances using a single thread compared to our multi-threaded implementation, and I think I have found the reason.

Indeed, contrary to pytorch, the result of numpy operations on non-contiguous arrays might return non-contiguous arrays.
For example

a = torch.rand(300, 300, 1000).permute(1, 0, 2)
an = a.numpy()

# to check
print(a.stride())
# (1000L, 300000L, 1L)
print(an.strides)
# (4000, 1200000, 4)

# now let's perform some operations
print((a * 2).stride())
# gives (300000L, 1000L, 1L), a contiguous tensor
print((an * 2).strides)
# gives (4000, 1200000, 4), the same as an

For contiguous tensors, in my small tests the performance of pytorch was on par with numpy in the single threaded case (as both seem to leverage SIMD instructions), including for operations like log and exp.

Since the beginning the contract in pytorch (and lua torch) was that (almost) all operations return a contiguous tensor, even if the inputs are non-contiguous. This doesn't seem to be the case, and lead to simple benchmarks leaning towards numpy being much faster.

The question is, in real pipelines with lots of operations, is there a value in keeping the original strides of the tensor à la numpy, in order to get better runtimes? Does it change the result in a significant manner?

@soumith
Copy link
Member

soumith commented Jan 23, 2018

@pytorchbot test this please

@soumith
Copy link
Member

soumith commented Jan 23, 2018

@MlWoo i want to get this merged. can you rebase once on top of master.

@soumith
Copy link
Member

soumith commented Jan 23, 2018

@pytorchbot add to whitelist

(also added via oss-ci)

@soumith
Copy link
Member

soumith commented Jan 23, 2018

@pytorchbot add to whitelist

@MlWoo
Copy link
Contributor Author

MlWoo commented Jan 23, 2018

@soumith Test passed in different environments. All work is done. Thanks.

@soumith soumith merged commit c2afd59 into pytorch:master Jan 23, 2018
@soumith
Copy link
Member

soumith commented Jan 23, 2018

@MlWoo thank you so much for this PR!

@fmassa
Copy link
Member

fmassa commented Jan 23, 2018

Thanks a lot @MlWoo , and sorry for the delay in reviewing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants