Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defect: same-image copies incorrect with teams #632

Closed
nathanweeks opened this issue Feb 9, 2019 · 10 comments · Fixed by #633
Closed

Defect: same-image copies incorrect with teams #632

nathanweeks opened this issue Feb 9, 2019 · 10 comments · Fixed by #633

Comments

@nathanweeks
Copy link
Contributor

Defect/Bug Report

  • OpenCoarrays Version: 2.4.0-6-gebc6f30
  • Fortran Compiler: gfortran 8.2.0
  • C compiler used for building lib: gcc 8.2.0
  • Installation method: CC=gcc FC=gfortran cmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=...
  • Output of uname -a: Linux ... 4.4.156-94.61.1.16335.0.PTF.1107299-default tests dis_transpose: test passed  #1 SMP ... x86_64 x86_64 x86_64 GNU/Linux
  • MPI library being used: MPICH 3.3
  • Machine architecture and number of physical cores:
  • Version of CMake: 3.11.4

Observed Behavior

*caf_get(), *caf_send(), and *caf_sendget() detect when the cosubscript(s) of the coarray(s) refer to the same image as the executing image; in that case, an optimized copy (avoiding MPI) is performed. However, this detection can be incorrect when the image indexes of the current team don't correspond to the cobounds of the coarray(s), leading to incorrect results.

Expected Behavior

Steps to Reproduce

The following examples form teams that have one image per team.

for *caf_get():

program teams_coarray_get
  use, intrinsic :: iso_fortran_env, only: team_type
  implicit none
  type(team_type) :: team
  integer, allocatable :: L(:)
  integer :: i, R[*]

  allocate(L(num_images()))
  R = this_image()

  form team (this_image(), team)
  change team (team)
    do i = 1, size(L)
      L(i) = R[i]
    end do
  end team

  write(*,*) this_image(), ':', L
end program teams_coarray_get

With 3 images, all images are expected to have L = [1,2,3]; however, only image 1 does:

$ caf teams_coarray_get.f90
$ cafrun -np 3 ./a.out
           1 :           1           2           3
           2 :           2           2           3
           3 :           3           2           3

*caf_send():

program teams_coarray_send
  use, intrinsic :: iso_fortran_env, only: team_type
  implicit none
  type(team_type) :: team
  integer, allocatable :: R(:)[:]
  integer :: i, initial_team_this_image

  allocate(R(num_images())[*], source=0)
  initial_team_this_image = this_image()

  form team (this_image(), team)
  change team (team)
    do i = 1, size(R)
      R(initial_team_this_image)[i] = initial_team_this_image
    end do
  end team

  write(*,*) this_image(), ':', R
end program teams_coarray_send

With 3 images, all are expected to have R = [1,2,3]; image 1's result is incorrect:

$ caf teams_coarray_send.f90
$ cafrun -n 3 ./a.out
           1 :           1           0           0
           2 :           1           2           3
           3 :           1           2           3

*caf_sendget():

program teams_coarray_sendget
  use, intrinsic :: iso_fortran_env, only: team_type
  implicit none
  type(team_type) :: team
  integer, allocatable :: R1(:,:)[:]
  integer :: i, j, R2[*]

  allocate(R1(num_images(), num_images())[*], source=0)
  R2 = this_image()

  form team (this_image(), team)
  change team (team)
    do concurrent (i = 1:ubound(R1,1), j = 1:ubound(R1,1))
      R1(R2,j)[i] = R2[j]
    end do
  end team

  write(*,*) this_image(), ':', R1
end program teams_coarray_sendget

On each image, R1 is expected to be a 3x3 matrix of the form:

1 2 3
1 2 3
1 2 3

("1 1 1 2 2 2 3 3 3" when output by the gfortran write(,) statement). However:

$ caf teams_coarray_sendget.f90
$ cafrun -np 3 ./a.out
           1 :           1           0           0           2           0           0           3           0           0
           2 :           1           2           3           2           2           2           3           3           3
           3 :           1           2           3           2           2           2           3           3           3
@nathanweeks
Copy link
Contributor Author

My bad: my examples violate the Fortran 2018 standard. Image index values specified in an image selector refer to the current team (unless the TEAM= or TEAM_NUMBER= specifier appear in an image selector, neither of which OpenCoarrays currently supports). Per Fortran 2018 (N2146 draft):

9.6 (3): If a TEAM= specifier appears in an image-selector, the team of the image selector is specified by team-value... If a TEAM_NUMBER= specifier appears in an image-selector and the current team is not the initial team, the value... Otherwise, the team of the image selector is the current team.
9.6 (4): An image selector shall specify an image index value that is not greater than the number of images in the team of the image selector, and identifies the image with that index in that team.

The same-image check in OpenCoarrays 2.4.0 should be valid as-is, and the PR is incorrect. But there is a bigger problem: image index values in image selectors appear to be interpreted not in the context of the current team, but in the team in which the coarray was allocated.

To illustrate, here's a (what I believe is a) standard-compliant modification of the above example code, which assigns odd-numbered images to team 1, and even-numbered images to team 2:

program teams_coarray_get
  use, intrinsic :: iso_fortran_env, only: team_type
  implicit none
  type(team_type) :: team
  integer, allocatable :: L(:)
  integer :: i, R[*]

  allocate(L(num_images()/2))
  R = this_image()

  form team (mod(this_image()-1,2)+1, team)
  change team (team)
    do i = 1, num_images()
      L(i) = R[i]
    end do
  end team

  write(*,*) this_image(), ':', L
end program teams_coarray_get
$ caf teams_coarray_get.f90
$ cafrun -np 4 ./a.out | sort -k 1n,1n
           1 :           1           2   
           2 :           2           2   
           3 :           1           3   
           4 :           1           4   

Expected result:

           1 :           1           3   
           2 :           2           4
           3 :           1           3
           4 :           2           4

Additional eyeballs to confirm/refute my interpretation of the Fortran 2018 standard & validity of the test case would be appreciated.

Assuming the above is accurate, it appears that the image index(es) specified in the image selector(s) must be translated from the MPI group of the current team to the MPI group of the corresponding coarray's MPI window the MPI_Put/MPI_Get. If there are no other volunteers, I think I should be able to implement this translation (caveat: time permitting) and submit an updated PR.

@zbeekman
Copy link
Collaborator

image index values in image selectors appear to be interpreted not in the context of the current team, but in the team in which the coarray was allocated.

Yes, this is definitely a bug that needs to be fixed! Thanks for catching this!

As such, I take it that the PR (#633) should NOT be merged?

I agree with your reading of the standard. Perhaps @rouson and or Michael can weigh in here to confirm?

@nathanweeks
Copy link
Contributor Author

@zbeekman : that's correct, PR #633 should not be merged. I'm in the process of fixing this issue for several routines (get, send, sendget, get_by_ref, send_by_ref, and sendget_by_ref) and adding simple unit tests. I should be able to finish this week.

Once I'm done, should I amend my previous commit in PR #633 (and force push), add a new commit, or submit a new PR? Furthermore, should this issue be closed and a new one submitted (since the actual issue appears to be the opposite of what I originally thought), or should the discussion continue in this issue (perhaps correcting the title)?

@nathanweeks
Copy link
Contributor Author

I'm running into an issue trying to implement this for sendget_by_ref, and I may be encountering a bug in the OpenCoarrays 2.5.0 implementation of sendget_by_ref:

program test_sendget_by_ref
  implicit none
  type :: rank1_type
    integer, allocatable :: A(:)
  end type
  type :: rank2_type
    integer, allocatable :: A(:,:)
  end type
  type(rank1_type) :: R_get[*]
  type(rank2_type) :: R_send[*]
  integer :: i, j

  allocate(R_get%A(this_image()), source=-1)
  R_get%A(this_image()) = this_image()

  allocate(R_send%A(num_images(),num_images()), source=-2)

  sync all

  do i = 1, num_images()
    do j = 1, num_images()
      R_send[i]%A(j,this_image()) = R_get[j]%A(j)
    end do
  end do

  sync all

  write(*,*) this_image(), ':', R_get%A, '|', R_send%A
end program test_sendget_by_ref

Output:

$ caf test_sendget_by_ref.f90
$ cafrun -np 3 ./a.out  | sort -k 1n,1n
           1 :           1 |           1          -2           2          -2           3          -2          -2          -2          -2
           2 :          -1           2 |           1          -2           2          -2           3          -2          -2          -2          -2
           3 :          -1          -1           3 |           1          -2           2          -2           3          -2          -2          -2          -2

I'm expecting the R_send%A array to be (for all images, after the "|" sign in the output):

1          2           3          1           2          3          1          2          3

Could someone double-check this example? If there is a problem with sendget_by_ref (and a fix isn't imminent), I could update PR #633 to fix the other aforementioned routines, and a new issue could be opened for sendget_by_ref.

nathanweeks added a commit to nathanweeks/OpenCoarrays that referenced this issue Mar 24, 2019
Translate image-selector image indices to be w.r.t. the current team.

Untested for sendget_by_ref due to possible issue described in sourceryinstitute#632.
@nathanweeks
Copy link
Contributor Author

I've updated PR #633 to reflect the interpretation of image-selector image index values in the context of teams as described in #632 (comment). I did what I think is the right thing for sendget_by_by_ref, but as per #632 (comment), I think there is a problem in sendget_by_ref as-is, and so those modifications are effectively untested.

@zbeekman zbeekman added this to Unconfirmed bugs in Bug squashing: Most wanted Mar 28, 2019
@stale
Copy link

stale bot commented Apr 21, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Apr 21, 2019
@zbeekman
Copy link
Collaborator

I did what I think is the right thing for sendget_by_by_ref, but as per #632 (comment), I think there is a problem in sendget_by_ref as-is, and so those modifications are effectively untested.

So just to make sure I'm following this:

  • sendget_by_ref was broken to begin with
  • Your modifications to sendget_by_ref may still be broken
  • You made a best effort to fix/implement sendget_by_ref correctly
  • There is no test coverage for sendget_by_ref still

After I merge PR #633, let's close this issue, assuming the teams problem you reported here is adequately addressed.

Can you then open a new issue describing any deficiencies/problems with sendget_by_ref that remain, even if it's just "I think it's broken and there is insufficient test coverage", for example.

Does that sound like a decent plan, and like I'm following things correctly?

@stale stale bot removed the stale label Apr 21, 2019
@ghost ghost removed the needs-review label Apr 21, 2019
@zbeekman
Copy link
Collaborator

@nathanweeks please see my previous comment and re-open/file new issues as appropriate. Thanks so much for your contribution to get this fixed!!!

@nathanweeks
Copy link
Contributor Author

@zbeekman : your understanding is correct; in particular, the sendget_by_ref issue is illustrated by #632 (comment). Per your advice, I'll open a new issue for it.

@zbeekman
Copy link
Collaborator

Great, thanks so much, Nathan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Bug squashing: Most wanted
  
Unconfirmed bugs
Development

Successfully merging a pull request may close this issue.

2 participants