Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should uses of outer-scope names from parallel code be restricted? #11

Open
nelsc opened this issue Dec 8, 2015 · 10 comments
Open

should uses of outer-scope names from parallel code be restricted? #11

nelsc opened this issue Dec 8, 2015 · 10 comments

Comments

@nelsc
Copy link
Contributor

nelsc commented Dec 8, 2015

No description provided.

@nelsc
Copy link
Contributor Author

nelsc commented Dec 8, 2015

If the purpose of this would be to try to do something about races, then what about races on objects accessed indirectly?

@phalpern
Copy link

The idea here is to reduce the chance of creating races. For example:

int x = 0;
_Task_parallel for (int i = 0; i < N; ++i)
    x += f(i);  // Error x is const within parallel for loop

The use of x above would be a race. We should be using a _Reduction for x. If modification is really needed, an explicit address can be used (indirect accesses are not prevented from racing, as Clark observed):

int x = 0;
_Atomic int y = 0;
float A[N];
_Task_parallel for _Capture(py=&y, pA=&A[0])
    (int i = 0; i < N; ++i)
{
    *py += f(i);   // OK, *py is a modifiable atomic
    pA[i] += g(i); // OK, pA[i] is modifiable and unique
}

One disadvantage of the above is that expressions like A[i] = expr are very common in parallel code, where A[i] is known not to race with other loop iterations. We might make a special case for arrays -- effectively treating them as decaying to pointers.

Another disadvantage is that no parallel language that I'm aware of has this restriction. I don't know of any usage experience. It certainly deviates from both Cilk and OpenMP.

@BlaineGarst
Copy link

Stack-based arrays are rarely large enough to merit parallel computation, most of the time one is dealing with allocated storage of some kind or, sure, static (global).

I suggest that we leave static (global) variables as they are, and const capture automatics. They will rarely be arrays of significant size. Arrays of size 4 are interesting for graphics as point multipliers etc. but those are small enough to be const captured. (Please note that since they would be CONST they could be shared among the worker threads. This happens with Apple's Blocks).

@jeffhammond
Copy link
Member

What if my parallel computation involves reading from a thousand-element automatic (i.e. stack-based) array and a million-element heap array while writing to a thousand-element automatic array? Is that going to be supported with this restriction?

Below is an excerpt from a widely used HPC application (NWChem - associated file is ccsd_t_kernels_omp.F) where I allocate a four-dimensional automatic array, where the dimensions are usually around 30 (i.e. 6.48MB). You can see that my current Fortran implementation computes on this array inside of OpenMP regions.

      subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7d,
     2                     triplesx,t2sub,v2sub)
      IMPLICIT NONE
      integer h3d,h2d,h1d,p6d,p5d,p4d,h7d
      integer h3,h2,h1,p6,p5,p4,h7
      double precision triplesx(h3d,h2d,h1d,p6d,p5d,p4d)
      double precision t2sub(h7d,p4d,p5d,h1d)
      double precision v2sub(h3d,h2d,p6d,h7d)
      double precision v2tmp(h7d,h3d,h2d,p6d)
!$omp  parallel do collapse(3)
!$omp& default(shared) schedule(static)
!$omp& private(h1,h2,h3,p4,p5,p6,h7)
      do p6=1,p6d
      do h7=1,h7d
      do h2=1,h2d
!dec$ vector always nontemporal
!DIR$ IVDEP
      do h3=1,h3d
        v2tmp(h7,h3,h2,p6) = v2sub(h3,h2,p6,h7)
      enddo
      enddo
      enddo
      enddo
!$omp  parallel do collapse(OMP_COLLAPSE_LEVEL)
!$omp& default(shared) schedule(static)
!$omp& private(h1,h2,h3,p4,p5,p6,h7)
      do p5=1,p5d
      do p6=1,p6d
      do p4=1,p4d
      do h1=1,h1d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
      do h2=1,h2d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
      do h3=1,h3d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
!dir$ simd
!DIR$ IVDEP
      do h7=1,h7d
       triplesx(h3,h2,h1,p6,p5,p4)=triplesx(h3,h2,h1,p6,p5,p4)
     1  -t2sub(h7,p4,p5,h1)*v2tmp(h7,h3,h2,p6)
      enddo
      enddo
      enddo
      enddo
      enddo
      enddo
      enddo
!$omp end parallel do
      return
      end

This is but one type of computation where automatic arrays are large enough to merit parallel computation.

And yes, I really do want to be able to write the same code in pure C some day. I have already started converting these functions to C in ccsd_t_kernels_f2c.c. I'd like to not have to use OpenMP if possible.

@BlaineGarst
Copy link

My FORTRAN education stopped at WATFOR in the early 70s, and my reading knowledge of OPENMP is even thinner, so...
... I think I see
double v2tmp[p6d][h2d][h3d][h7d]; // row column reversed vs FORTRAN

So a couple small points. Stack arrays of this size are not common in my non-HPC experience - one has to use pthread_create with options to spawn threads with huge stacks, presumably that is what the CPLEX runtime would do.
Second, well, wouldn't something like
#define V2TMP(a,b,c,d) v2sub[b][c][d][a]
be correct or am I missing some secret aliasing of v2sub to triplesx by the caller?

The major point you are trying to make, I think, is that there will be cases where large automatic arrays prove useful. So in the "make simple things simple, difficult things possible" design sense, wouldn't forcing the array to decay to a pointer explicitly work here? If its common enough then it might be worth extra syntax
_Task_parallel _CaptureAsReference(v2tmp) for ( ...

My view, obviously, is that safety should be the default (const capture) and riskier behaviors such as concurrent initialization of v2tmp be asserted parallel correct by the programmer by use of the extra syntax. The extra syntax isn't necessary since the programmer could hand roll the address capture, but its less eloquent for sure.

@jeffhammond
Copy link
Member

My FORTRAN education stopped at WATFOR in the early 70s, and my reading knowledge of OPENMP is even thinner, so...
... I think I see
double v2tmp[p6d][h2d][h3d][h7d]; // row column reversed vs FORTRAN

I will try to make the C implementation match the Fortran one. Then I can try to convert the OpenMP to CPLEX to see how things look.

The code right now doesn't need tasking but it would if the data was block-sparse. Currently, NWChem expresses block-sparsity at a higher level in the code than what you see in my example.

So a couple small points. Stack arrays of this size are not common in my non-HPC experience - one has to use pthread_create with options to spawn threads with huge stacks, presumably that is what the CPLEX runtime would do.
Second, well, wouldn't something like
#define V2TMP(a,b,c,d) v2sub[b][c][d][a]
be correct or am I missing some secret aliasing of v2sub to triplesx by the caller?

The point is to change the layout in physical memory so that subsequent accesses are fast. Alignment and contiguous access are important for vectorization.

The major point you are trying to make, I think, is that there will be cases where large automatic arrays prove useful. So in the "make simple things simple, difficult things possible" design sense, wouldn't forcing the array to decay to a pointer explicitly work here? If its common enough then it might be worth extra syntax
_Task_parallel _CaptureAsReference(v2tmp) for ( ...

It depends. Will compilers be able to preserve the information about the multidimensional array structure in the auto-vectorization passes with decay-to-pointer?

My view, obviously, is that safety should be the default (const capture) and riskier behaviors such as concurrent initialization of v2tmp be asserted parallel correct by the programmer by use of the extra syntax. The extra syntax isn't necessary since the programmer could hand roll the address capture, but its less eloquent for sure.

I agree with this philosophy. Let me think about how to implement what I want a bit more in CPLEX and see how it looks.

@nelsc
Copy link
Contributor Author

nelsc commented Dec 17, 2015

I suggest that we leave static (global) variables as they are, and const capture automatics.

If the rationale is to make races less likely, it would make sense to treat globals differently only if we knew that uses of them were less likely to be involved in races than uses of autos.

Do we know that? If so, how? If not, what is the rationale for treating them differently?

@BlaineGarst
Copy link

(jeff) So v2tmp is purely a performance "hack" (in a good sense). I don't doubt that for massive data sets that this can be a critical win, and can be supported in my proposal by manually degrading the array to a pointer, or having syntax for that.

(clark) The race question is important - the current proposal would allow non-atomic automatic variables to be shared and mutated across iterations and back to the originating stack frame, and might seem to work when there isn't much parallelism and might mostly work in fact. I am concerned about non-mutating uses as well.

By const capturing them we put a copy near other captures and possibly improve locality, and put a huge stake in the ground when they try to mutate and get a compiler error. It helps teach issues.

Globals in a more general programming environment than HPC are already known to have racing issues and are rarely (in my non-HPC experience) used in library or other code that is multi-threaded. The HPC use cases are different in that globals, especially from a FORTRAN background, used to hold the primary data. Their racing issues are very well known since, well, CPLEX requires that the parallel code not induce a race, and copying seems to be counter-intuitive.

It turns out that Apple's Blocks don't copy globals, so a similar conceptual treatment leads to fewer concept to learn in parallel and, potentially, concurrent code with closures. A weird issue that came up was that function calls were obviously pointer references to global code and should they be const copied and that led in part to the decision to not capture globals. And in the years since it shipped there were no complaints about not capturing a Global, but, we didn't have _Atomic.

So to some degree I say that yes, the HPC experience of using non-atomic globals safely is relevant field experience.

@nelsc
Copy link
Contributor Author

nelsc commented Dec 18, 2015

I am concerned about non-mutating uses as well.

Lacking any explanation of the nature of your concern, I don't know what to do with this statement.

@phalpern
Copy link

A few comments and observation:

  1. I strongly believe that outer-scope global and local variables should be treated identically. Anything else is overly complex and confusing.
  2. Local arrays can be important. With a sufficiently light-weight parallel runtime system, an array of 20 elements can absolutely benefit from parallelization.
  3. If we treat all outer-scope variables as const, then common parallel code that modifies independent array elements will not work without extra work.
  4. The extra work in the previous item might be expressible as a simple _Capture. E.g., for array data, you could force pointer decay using _Capture(data=&data[0]). Using this approach for parallel loops would require adding a _Capture clause to loops, which is not currently in the WP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants