should uses of outer-scope names from parallel code be restricted? #11

nelsc · 2015-12-08T01:32:37Z

No description provided.

nelsc · 2015-12-08T01:43:48Z

If the purpose of this would be to try to do something about races, then what about races on objects accessed indirectly?

phalpern · 2015-12-14T23:04:21Z

The idea here is to reduce the chance of creating races. For example:

int x = 0;
_Task_parallel for (int i = 0; i < N; ++i)
    x += f(i);  // Error x is const within parallel for loop

The use of x above would be a race. We should be using a _Reduction for x. If modification is really needed, an explicit address can be used (indirect accesses are not prevented from racing, as Clark observed):

int x = 0;
_Atomic int y = 0;
float A[N];
_Task_parallel for _Capture(py=&y, pA=&A[0])
    (int i = 0; i < N; ++i)
{
    *py += f(i);   // OK, *py is a modifiable atomic
    pA[i] += g(i); // OK, pA[i] is modifiable and unique
}

One disadvantage of the above is that expressions like A[i] = expr are very common in parallel code, where A[i] is known not to race with other loop iterations. We might make a special case for arrays -- effectively treating them as decaying to pointers.

Another disadvantage is that no parallel language that I'm aware of has this restriction. I don't know of any usage experience. It certainly deviates from both Cilk and OpenMP.

BlaineGarst · 2015-12-16T09:46:19Z

Stack-based arrays are rarely large enough to merit parallel computation, most of the time one is dealing with allocated storage of some kind or, sure, static (global).

I suggest that we leave static (global) variables as they are, and const capture automatics. They will rarely be arrays of significant size. Arrays of size 4 are interesting for graphics as point multipliers etc. but those are small enough to be const captured. (Please note that since they would be CONST they could be shared among the worker threads. This happens with Apple's Blocks).

jeffhammond · 2015-12-16T15:53:37Z

What if my parallel computation involves reading from a thousand-element automatic (i.e. stack-based) array and a million-element heap array while writing to a thousand-element automatic array? Is that going to be supported with this restriction?

Below is an excerpt from a widely used HPC application (NWChem - associated file is ccsd_t_kernels_omp.F) where I allocate a four-dimensional automatic array, where the dimensions are usually around 30 (i.e. 6.48MB). You can see that my current Fortran implementation computes on this array inside of OpenMP regions.

      subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7d,
     2                     triplesx,t2sub,v2sub)
      IMPLICIT NONE
      integer h3d,h2d,h1d,p6d,p5d,p4d,h7d
      integer h3,h2,h1,p6,p5,p4,h7
      double precision triplesx(h3d,h2d,h1d,p6d,p5d,p4d)
      double precision t2sub(h7d,p4d,p5d,h1d)
      double precision v2sub(h3d,h2d,p6d,h7d)
      double precision v2tmp(h7d,h3d,h2d,p6d)
!$omp  parallel do collapse(3)
!$omp& default(shared) schedule(static)
!$omp& private(h1,h2,h3,p4,p5,p6,h7)
      do p6=1,p6d
      do h7=1,h7d
      do h2=1,h2d
!dec$ vector always nontemporal
!DIR$ IVDEP
      do h3=1,h3d
        v2tmp(h7,h3,h2,p6) = v2sub(h3,h2,p6,h7)
      enddo
      enddo
      enddo
      enddo
!$omp  parallel do collapse(OMP_COLLAPSE_LEVEL)
!$omp& default(shared) schedule(static)
!$omp& private(h1,h2,h3,p4,p5,p6,h7)
      do p5=1,p5d
      do p6=1,p6d
      do p4=1,p4d
      do h1=1,h1d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
      do h2=1,h2d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
      do h3=1,h3d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
!dir$ simd
!DIR$ IVDEP
      do h7=1,h7d
       triplesx(h3,h2,h1,p6,p5,p4)=triplesx(h3,h2,h1,p6,p5,p4)
     1  -t2sub(h7,p4,p5,h1)*v2tmp(h7,h3,h2,p6)
      enddo
      enddo
      enddo
      enddo
      enddo
      enddo
      enddo
!$omp end parallel do
      return
      end

This is but one type of computation where automatic arrays are large enough to merit parallel computation.

And yes, I really do want to be able to write the same code in pure C some day. I have already started converting these functions to C in ccsd_t_kernels_f2c.c. I'd like to not have to use OpenMP if possible.

BlaineGarst · 2015-12-16T18:44:32Z

My FORTRAN education stopped at WATFOR in the early 70s, and my reading knowledge of OPENMP is even thinner, so...
... I think I see
double v2tmp[p6d][h2d][h3d][h7d]; // row column reversed vs FORTRAN

So a couple small points. Stack arrays of this size are not common in my non-HPC experience - one has to use pthread_create with options to spawn threads with huge stacks, presumably that is what the CPLEX runtime would do.
Second, well, wouldn't something like
#define V2TMP(a,b,c,d) v2sub[b][c][d][a]
be correct or am I missing some secret aliasing of v2sub to triplesx by the caller?

The major point you are trying to make, I think, is that there will be cases where large automatic arrays prove useful. So in the "make simple things simple, difficult things possible" design sense, wouldn't forcing the array to decay to a pointer explicitly work here? If its common enough then it might be worth extra syntax
_Task_parallel _CaptureAsReference(v2tmp) for ( ...

My view, obviously, is that safety should be the default (const capture) and riskier behaviors such as concurrent initialization of v2tmp be asserted parallel correct by the programmer by use of the extra syntax. The extra syntax isn't necessary since the programmer could hand roll the address capture, but its less eloquent for sure.

jeffhammond · 2015-12-16T22:18:08Z

My FORTRAN education stopped at WATFOR in the early 70s, and my reading knowledge of OPENMP is even thinner, so...
... I think I see
double v2tmp[p6d][h2d][h3d][h7d]; // row column reversed vs FORTRAN

I will try to make the C implementation match the Fortran one. Then I can try to convert the OpenMP to CPLEX to see how things look.

The code right now doesn't need tasking but it would if the data was block-sparse. Currently, NWChem expresses block-sparsity at a higher level in the code than what you see in my example.

So a couple small points. Stack arrays of this size are not common in my non-HPC experience - one has to use pthread_create with options to spawn threads with huge stacks, presumably that is what the CPLEX runtime would do.
Second, well, wouldn't something like
#define V2TMP(a,b,c,d) v2sub[b][c][d][a]
be correct or am I missing some secret aliasing of v2sub to triplesx by the caller?

The point is to change the layout in physical memory so that subsequent accesses are fast. Alignment and contiguous access are important for vectorization.

The major point you are trying to make, I think, is that there will be cases where large automatic arrays prove useful. So in the "make simple things simple, difficult things possible" design sense, wouldn't forcing the array to decay to a pointer explicitly work here? If its common enough then it might be worth extra syntax
_Task_parallel _CaptureAsReference(v2tmp) for ( ...

It depends. Will compilers be able to preserve the information about the multidimensional array structure in the auto-vectorization passes with decay-to-pointer?

My view, obviously, is that safety should be the default (const capture) and riskier behaviors such as concurrent initialization of v2tmp be asserted parallel correct by the programmer by use of the extra syntax. The extra syntax isn't necessary since the programmer could hand roll the address capture, but its less eloquent for sure.

I agree with this philosophy. Let me think about how to implement what I want a bit more in CPLEX and see how it looks.

nelsc · 2015-12-17T01:35:44Z

I suggest that we leave static (global) variables as they are, and const capture automatics.

If the rationale is to make races less likely, it would make sense to treat globals differently only if we knew that uses of them were less likely to be involved in races than uses of autos.

Do we know that? If so, how? If not, what is the rationale for treating them differently?

BlaineGarst · 2015-12-18T06:20:19Z

(jeff) So v2tmp is purely a performance "hack" (in a good sense). I don't doubt that for massive data sets that this can be a critical win, and can be supported in my proposal by manually degrading the array to a pointer, or having syntax for that.

(clark) The race question is important - the current proposal would allow non-atomic automatic variables to be shared and mutated across iterations and back to the originating stack frame, and might seem to work when there isn't much parallelism and might mostly work in fact. I am concerned about non-mutating uses as well.

By const capturing them we put a copy near other captures and possibly improve locality, and put a huge stake in the ground when they try to mutate and get a compiler error. It helps teach issues.

Globals in a more general programming environment than HPC are already known to have racing issues and are rarely (in my non-HPC experience) used in library or other code that is multi-threaded. The HPC use cases are different in that globals, especially from a FORTRAN background, used to hold the primary data. Their racing issues are very well known since, well, CPLEX requires that the parallel code not induce a race, and copying seems to be counter-intuitive.

It turns out that Apple's Blocks don't copy globals, so a similar conceptual treatment leads to fewer concept to learn in parallel and, potentially, concurrent code with closures. A weird issue that came up was that function calls were obviously pointer references to global code and should they be const copied and that led in part to the decision to not capture globals. And in the years since it shipped there were no complaints about not capturing a Global, but, we didn't have _Atomic.

So to some degree I say that yes, the HPC experience of using non-atomic globals safely is relevant field experience.

nelsc · 2015-12-18T16:13:14Z

I am concerned about non-mutating uses as well.

Lacking any explanation of the nature of your concern, I don't know what to do with this statement.

phalpern · 2015-12-21T16:58:13Z

A few comments and observation:

I strongly believe that outer-scope global and local variables should be treated identically. Anything else is overly complex and confusing.
Local arrays can be important. With a sufficiently light-weight parallel runtime system, an array of 20 elements can absolutely benefit from parallelization.
If we treat all outer-scope variables as const, then common parallel code that modifies independent array elements will not work without extra work.
The extra work in the previous item might be expressible as a simple _Capture. E.g., for array data, you could force pointer decay using _Capture(data=&data[0]). Using this approach for parallel loops would require adding a _Capture clause to loops, which is not currently in the WP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should uses of outer-scope names from parallel code be restricted? #11

should uses of outer-scope names from parallel code be restricted? #11

nelsc commented Dec 8, 2015

nelsc commented Dec 8, 2015

phalpern commented Dec 14, 2015

BlaineGarst commented Dec 16, 2015

jeffhammond commented Dec 16, 2015

BlaineGarst commented Dec 16, 2015

jeffhammond commented Dec 16, 2015

nelsc commented Dec 17, 2015

BlaineGarst commented Dec 18, 2015

nelsc commented Dec 18, 2015

phalpern commented Dec 21, 2015

should uses of outer-scope names from parallel code be restricted? #11

should uses of outer-scope names from parallel code be restricted? #11

Comments

nelsc commented Dec 8, 2015

nelsc commented Dec 8, 2015

phalpern commented Dec 14, 2015

BlaineGarst commented Dec 16, 2015

jeffhammond commented Dec 16, 2015

BlaineGarst commented Dec 16, 2015

jeffhammond commented Dec 16, 2015

nelsc commented Dec 17, 2015

BlaineGarst commented Dec 18, 2015

nelsc commented Dec 18, 2015

phalpern commented Dec 21, 2015