[Done]parallelize elementwise operation with openmp #2764

MlWoo · 2017-09-18T00:56:55Z

Most of elementwise operations of discontiguous THTensor such as copy, addition, multiplication and so on are serial with CPU backend, and the openmp overhead theshold is too high. This commit will parallelize elementwise operation of discontiguous THTensor with openmp.

soumith · 2017-09-18T15:23:41Z

thanks a lot for the contribution. we'll review it within a week (it touches core parts).

I see that you've reduced the TH_OMP_OVERHEAD_THRESHOLD from 100k elements to 4k elements. Have you done some performance comparisons to make sure that this is okay? 4k elements seems small.

torch/lib/TH/generalFunc.h

+#ifndef GENERAL_FUNC_H
+#define GENERAL_FUNC_H
+
+ptrdiff_t SearchingIndex(ptrdiff_t index, long *stride, int dim, long* size)


torch/lib/TH/generic/THTensor.c

+  {
+    if(self->stride[d] == 0)
+    {
+		return 1;


torch/lib/TH/generic/THTensor.c

@@ -640,6 +640,20 @@ int THTensor_(isTransposed)(const THTensor *self)
  return 0;
 }

+int THTensor_(hasZeroStride)(const THTensor *self)
+{
+  long z = 1;


torch/lib/TH/generic/THTensorMath.c

+#if defined(TH_REAL_IS_BYTE)
+        TH_TENSOR_APPLY2_ADVANCED_INDEX(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = (((real) *t_data) >> value););
+#else
+        TH_TENSOR_APPLY2_ADVANCED_INDEX(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = (((unsigned real) *t_data) >> value););


fmassa · 2017-09-18T20:46:19Z

Thanks a lot for the PR!
I did a first quick pass and I have some small comments. I didn't check in details the macros yet (nor why you avoid handling expanded input tensors tensors in parallel).
Also, as @soumith pointed out, do you have some performance comparisons between the previous behavior and the one proposed in this PR?

MlWoo · 2017-09-19T01:53:39Z

@soumith As your concern, that the TH_OMP_OVERHEAD_THRESHOLD value is set to 4k is really controversial. 4k is an empirical value from a our previous case. And we strongly believe that the value is dependent on specifical CPU platform. We want to provide a benchmark to explain it and hope to set the value according to the CPU when compiling the code. You may want to reproduce the performance of benchmark. But we focus on the performance CPU in servers like Xeon and Xeon phi which have at least 22 cores. I am afraid that these CPUs are unavailuable to you. We also have CPUs in desktop like i7-5960x. I think that you can provide some models of CPU and we will add these models of CPU to our benchmark as far as possible so that you can reproduce the result conveniently.

MlWoo · 2017-09-19T08:41:55Z

@fmassa You must have known that some confusing behavior occurs to expanding tensors, as you point in Torch Buggy cmul behavior on Tensors with 0-strides. The method to parallelize the operation use the number of tensor's element to calculate the offset in memory in the macros. But the real memory size of expanding tensor is less than that what users refer it. So that it will cause outing of bound of memory.

fmassa · 2017-09-19T10:08:48Z

@MlWoo while I agree that the result tensor having zero strides can lead to weird behaviour, input tensors should work just fine I think.

Indeed, your index calculation multiplies the offset by the stride, so zero stride should not change the index?
And because we are talking about input tensors, there is only read access happening, not write.
But I might probably be missing something here :)

Also, a small nit: the name TH_TENSOR_APPLY2_ADVANCED_INDEX might lead to confusion, as advanced index is a specific operation in pytorch/numpy, so might be worth changing the name to something else (I don't have a better name now).

MlWoo · 2017-09-19T12:33:33Z

@fmassa I avoid handling that situation to avoid handling some operation involved with copying from a expanding tensor at first. I had misunderstood that here.
But your view is very correct. And your view make me cognitive snap. Now I know the index calculation could be also used to calculate the index in a expanding tensor. I will modify the code and later.

In term of the name of Macro, how about TH_TENSOR_APPLY2_OMP?

Thanks a lot for your advice.

soumith · 2017-09-19T14:56:59Z

TH_TENSOR_APPLY2_OMP sounds good.

albanD

Thanks for the PR.

I have only looked at the file structure and THTensorCopy.c for now.
I will look in details the macro and THTensorMath.c later.

It would be interesting to see the benchmark you are using so that we can run it on different machines/architectures as well.

torch/lib/TH/generic/THTensorCopy.c

+
+  int serial_path = 0;
+  int inOMP = omp_in_parallel();
+  if (tensorSize != tensorSize) {


torch/lib/TH/CMakeLists.txt

@@ -384,6 +384,7 @@ INSTALL(TARGETS TH
  ARCHIVE DESTINATION "${TH_INSTALL_LIB_SUBDIR}")

 INSTALL(FILES
+  generalFunc.h


torch/lib/TH/generic/THTensorCopy.c

+  int srcContig = THTensor_(isContiguous)(src);
+
+  int serial_path = 0;
+  int inOMP = omp_in_parallel();


torch/lib/TH/generic/THTensorCopy.c

+#ifdef _OPENMP
+      int tensorZeroStride = THTensor_(hasZeroStride)(tensor);
+      int srcZeroStride = THTensor_(hasZeroStride)(src);
+      if (inOMP && (tensorZeroStride||srcZeroStride)) {


torch/lib/TH/generic/THTensorCopy.c

+        TH_TENSOR_APPLY2_ADVANCED_INDEX(srcSize, tensorContig, srcContig, real, tensor, real, src, *tensor_data = *src_data;)
+      }
+#else
+      TH_TENSOR_APPLY2(real, tensor, real, src, *tensor_data = *src_data;)


MlWoo · 2017-09-19T16:34:11Z

@soumith @albanD Could you provide some names of model of CPUs which are available to you? I think it will be convenient to reproduce the benchmark for you. We will find the same models of CPU in our cooperation to test the performance.

fmassa · 2017-09-20T14:11:28Z

@MlWoo I had another quick look at your PR (thanks once again!), I wonder if we shouldn't check for concurrent writes on the functions (which means the result tensor r_ maybe shouldn't have zero strides), but I'm not sure and someone with more experience on OMP should definitely comment on this.
If what I said is not necessary, then we can probably remove the hasZeroStride function, as it's not used anywhere in the code anymore.

MlWoo · 2017-09-21T12:06:39Z

@fmassa What concurrent write is a same value. It results in only performance drop but does not make wrong result. We can probably remove the hasZeroStride function.

MlWoo · 2017-09-22T13:23:28Z

@soumith @fmassa @albanD We have released some data and benchmark. Could you spare time to review it? We will release more data in next week. Thanks a lot.

fmassa · 2017-09-27T09:47:08Z

@MlWoo I had a quick look at the code for the benchmark and it looks good, thanks!
I have a quick question: for the comparison before your optimizations, did you use pytorch after adding this PR #2792 or was it before it?
Thanks!

MlWoo · 2017-09-29T06:57:31Z

@fmassa The comparison of official pytorch was before that PR #2792 . We will evaluate the performance after adding this PR later again.

MlWoo · 2017-12-14T10:20:06Z

@fmassa A more efficient method to accelerate basic operations is implemented. Could you spare time to review the code? The benchmark has also updated. We will pull a new request to modify the TH_OMP_OVERHEAD_THRESHOLD value later. Thanks a lot.

fmassa · 2017-12-14T11:25:54Z

Hey, thanks a lot for the improvements. I'll try to have a look later today.
Also, @gchanan is working on implementing the same macros on ATen in #4161, I think it might be worth at some point applying this patches there

fmassa · 2017-12-14T16:58:55Z

@MlWoo after a first quick look this looks good, but I won't have much time to look into it in detail before at least tomorrow.

MlWoo · 2017-12-18T07:50:13Z

@fmassa The PR touches the core and the change is large. Thanks a lot. Look forward to your suggestion.

MlWoo · 2018-01-09T11:58:19Z

@fmassa Could you spare time to review the PR? I want to optimize nn module based on the macro.

fmassa · 2018-01-12T17:58:31Z

Hi @MlWoo ,
Sorry for the delay. I'll get this reviewed this Monday!

fmassa

Hi,

Once again thanks a lot for the PR!

I spent quite some time reviewing this PR, and my general impression is that this looks good.

But I found it very difficult to understand some parts of the code, and I think comment comments in the code would help a lot.

I initially got confused between the relationship of line_index_offset/line_index_end and the real iteration order that happens in the flattened tensor (which is not in the range of line_index_offset and line_index_end).

Could you please add some comments in the code explaining the new macros that were added, an how the iteration order happens? That would help a lot to someone new to those functions to understand the code.

Thanks!

aten/src/TH/THTensorApply.h

+  ptrdiff_t TENSOR##_offset = 0;                                                                \
+  ptrdiff_t TENSOR##_quot = line_index_offset;                                                           \
+  for (TENSOR##_i = TENSOR##_dim-1; TENSOR##_i>=0; --TENSOR##_i) {                              \
+    TENSOR##_counter_tmp[TENSOR##_i] = TENSOR##_quot%TENSOR##_sizes[TENSOR##_i];                         \


aten/src/TH/THTensorApply.h

+    TENSOR##_offset += TENSOR##_counter_tmp[TENSOR##_i] * TENSOR##_strides[TENSOR##_i];         \
+  }
+
+#define __TH_TENSOR_APPLYX_UPDATE_COUNTERS_OMP(TENSOR) \


aten/src/TH/THTensorApply.h

+          TENSOR2##_data += TENSOR2##_stride;                                                                \
+          TENSOR1##_data += TENSOR1##_stride;                                                                \
+        }                                                                                                    \
+        if (count < line_seg_len){                                                                           \


aten/src/TH/THTensorApply.h

+    for(TENSOR##_i = TENSOR##_dim - 2; (TENSOR##_i >= 0) && (TENSOR##_carry_coord); TENSOR##_i--){ \
+      TENSOR##_counter_tmp[TENSOR##_i]++; \
+      TENSOR##_data += TENSOR##_strides[TENSOR##_i]; \
+      if(TENSOR##_counter_tmp[TENSOR##_i] == TENSOR##_sizes[TENSOR##_i]){ \


MlWoo · 2018-01-16T07:20:10Z

@fmassa I have added some comments to make someone else could understand the code easily. It is not easy to explain the complex process for me in English. Plz review it and give me some suggestions. Thanks a lot.

MlWoo · 2018-01-19T11:03:21Z

@fmassa The threshold of reduction operation like sum is fixed to 5000 for the moment. It won't utilize multi-thread if the tensor size is not greater than 5000. I am not sure wether it plays effect on that or not.

aten/src/TH/generic/THTensorMath.c

+    if (inOMP) {
+      serial_path = 1;
+    } else {
+      TH_TENSOR_APPLY2_OMP(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = *t_data * value;)


aten/src/TH/generic/THTensorMath.c

+        real *r__data = rp+iter;
+        *r__data = 0;
+        for(j=0; j < t->size[dimension]; ++j) {
+          *r__data += *(t_data + j*t->stride[dimension]);


apaszke · 2018-01-19T11:14:28Z

@fmassa Oh is it because this adds a whole new set of macros instead of modifying our old ones? I think we should drop the way we do these things at the moment, in favor of a more modern C++ approach. It's really hard to understand with TENSOR_##SOMETHING all around the place.

I think a good sanity check to do before merging this would be to decrease the OMP threshold to 0 and verify that all our tests still pass. They might be running on tensors not large enough to trigger these code paths.

fmassa · 2018-01-19T13:05:20Z

@apaske yes, it creates new macros. And I agree about the tests, and that's why I installed locally and started doing some manual checks. +1 for a C++ implementation, but that can come later I think?

@MlWoo I made sure that the size of the tensor were big enough so that the OMP path was activated.

fmassa · 2018-01-22T23:28:44Z

Ok, I had a closer look into the reason why numpy was showing incredibly good performances using a single thread compared to our multi-threaded implementation, and I think I have found the reason.

Indeed, contrary to pytorch, the result of numpy operations on non-contiguous arrays might return non-contiguous arrays.
For example

a = torch.rand(300, 300, 1000).permute(1, 0, 2)
an = a.numpy()

# to check
print(a.stride())
# (1000L, 300000L, 1L)
print(an.strides)
# (4000, 1200000, 4)

# now let's perform some operations
print((a * 2).stride())
# gives (300000L, 1000L, 1L), a contiguous tensor
print((an * 2).strides)
# gives (4000, 1200000, 4), the same as an

For contiguous tensors, in my small tests the performance of pytorch was on par with numpy in the single threaded case (as both seem to leverage SIMD instructions), including for operations like log and exp.

Since the beginning the contract in pytorch (and lua torch) was that (almost) all operations return a contiguous tensor, even if the inputs are non-contiguous. This doesn't seem to be the case, and lead to simple benchmarks leaning towards numpy being much faster.

The question is, in real pipelines with lots of operations, is there a value in keeping the original strides of the tensor à la numpy, in order to get better runtimes? Does it change the result in a significant manner?

soumith · 2018-01-23T02:19:29Z

@pytorchbot test this please

soumith · 2018-01-23T02:25:11Z

@MlWoo i want to get this merged. can you rebase once on top of master.

soumith · 2018-01-23T02:54:46Z

@pytorchbot add to whitelist

(also added via oss-ci)

soumith · 2018-01-23T02:55:04Z

@pytorchbot add to whitelist

MlWoo · 2018-01-23T06:10:38Z

@soumith Test passed in different environments. All work is done. Thanks.

soumith · 2018-01-23T16:35:21Z

@MlWoo thank you so much for this PR!

fmassa · 2018-01-23T17:22:25Z

Thanks a lot @MlWoo , and sorry for the delay in reviewing it!

soumith requested a review from killeent September 18, 2017 15:21

fmassa reviewed Sep 18, 2017

View reviewed changes

torch/lib/TH/generalFunc.h Outdated

#ifndef GENERAL_FUNC_H

#define GENERAL_FUNC_H

ptrdiff_t SearchingIndex(ptrdiff_t index, long *stride, int dim, long* size)

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

fmassa reviewed Sep 18, 2017

View reviewed changes

torch/lib/TH/generic/THTensor.c Outdated

{

if(self->stride[d] == 0)

{

return 1;

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

fmassa reviewed Sep 18, 2017

View reviewed changes

torch/lib/TH/generic/THTensor.c Outdated

@@ -640,6 +640,20 @@ int THTensor_(isTransposed)(const THTensor *self)

return 0;

}

int THTensor_(hasZeroStride)(const THTensor *self)

{

long z = 1;

This comment was marked as off-topic.

Sign in to view

fmassa reviewed Sep 18, 2017

View reviewed changes

albanD reviewed Sep 19, 2017

View reviewed changes

MlWoo changed the title ~~parallelize elementwise operation with openmp~~ [WIP]parallelize elementwise operation with openmp Sep 25, 2017

MlWoo changed the title ~~[WIP]parallelize elementwise operation with openmp~~ [Done]parallelize elementwise operation with openmp Dec 14, 2017

MlWoo mentioned this pull request Dec 15, 2017

[Proposal] OMP overhead threshold setting #4188

Closed

fmassa requested changes Jan 15, 2018

View reviewed changes

apaszke reviewed Jan 19, 2018

View reviewed changes

MlWoo added 4 commits January 23, 2018 10:36

parallelize discontiguous tensors' basic operations

5520ee2

add comments

d12dfae

remove unnecessary header file

d312784

remove trailing whitespace

2bf642f

resolve omp parallel for error(need for statement directly) in windows

1ccad04

soumith merged commit c2afd59 into pytorch:master Jan 23, 2018

MlWoo deleted the dev-omp branch January 24, 2018 00:12

MlWoo restored the dev-omp branch January 24, 2018 00:12

MlWoo deleted the dev-omp branch January 24, 2018 00:30

This was referenced Feb 7, 2018

[feature request] Part of TH compiling warning should be fixed in another way #5103

Closed

remove some warning introduced by #2764 #5104

Merged

soumith pushed a commit that referenced this pull request Feb 8, 2018

remove some warning introduced by #2764 (#5104)

beb9fe6

mingfeima mentioned this pull request Feb 8, 2018

Feature Request: CPU performance optimization with MKL-DNN #4186

Open

MlWoo mentioned this pull request Aug 9, 2018

How to test performance of different openmp theshold? zy97140/omp-benchmark-for-pytorch#1

Closed

mratsim mentioned this pull request Oct 19, 2018

Does setting a lower number of cores help? zy97140/omp-benchmark-for-pytorch#2

Closed

ezyang added the open source label Jun 24, 2019

[Done]parallelize elementwise operation with openmp #2764

[Done]parallelize elementwise operation with openmp #2764

Uh oh!

Conversation

MlWoo commented Sep 18, 2017

Uh oh!

soumith commented Sep 18, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

fmassa commented Sep 18, 2017

Uh oh!

MlWoo commented Sep 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MlWoo commented Sep 19, 2017

Uh oh!

fmassa commented Sep 19, 2017

Uh oh!

MlWoo commented Sep 19, 2017

Uh oh!

soumith commented Sep 19, 2017

Uh oh!

albanD left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

MlWoo commented Sep 19, 2017

Uh oh!

fmassa commented Sep 20, 2017

Uh oh!

MlWoo commented Sep 21, 2017

Uh oh!

MlWoo commented Sep 22, 2017

Uh oh!

fmassa commented Sep 27, 2017

Uh oh!

MlWoo commented Sep 29, 2017

Uh oh!

MlWoo commented Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

MlWoo commented Sep 19, 2017 •

edited

Loading

albanD left a comment •

edited

Loading

MlWoo commented Dec 14, 2017 •

edited

Loading

MlWoo commented Jan 16, 2018 •

edited

Loading

MlWoo commented Jan 19, 2018 •

edited

Loading

MlWoo commented Jan 23, 2018 •

edited

Loading