Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior with co_reduce #172

Closed
floquet opened this issue Apr 16, 2016 · 43 comments
Closed

Unexpected behavior with co_reduce #172

floquet opened this issue Apr 16, 2016 · 43 comments

Comments

@floquet
Copy link

floquet commented Apr 16, 2016

The co_reduce collective appears to be posting random, large number to result_image using the reference example in the gfortran documentation. A diagnostic version of the code and results are included in Stack Overflow: Fortran coarray anomaly with co_reduce.

Sample code showing compilation and execution:
co_reduce.tar.gz

@afanfa
Copy link
Contributor

afanfa commented Apr 16, 2016

Thanks for reporting this. I've found the reason for this problem and I'm producing a patch (for OpenCoarrays). What I don't completely understand is why this problem hasn't come out earlier.

@rouson
Copy link
Member

rouson commented Apr 17, 2016

Presumably it hasn't appeared sooner because our current co_reduce unit test doesn't use the optional result_image argument. I'll add tests that include the result_image argument.

@afanfa
Copy link
Contributor

afanfa commented Apr 17, 2016

Unfortunately result_image is not the problem. The patch that solves this bug breaks the co_reduce test already part of the OpenCoarrays test suite.

@rouson
Copy link
Member

rouson commented Apr 17, 2016

Wow. Thanks for catching this. So if the patch breaks the current co_reduce test, does that mean the current co_reduce test is incorrect.

@afanfa
Copy link
Contributor

afanfa commented Apr 17, 2016

The compiler does something strange on the co_reduce test included in the test suite. I'm trying to understand why...

@afanfa
Copy link
Contributor

afanfa commented Apr 17, 2016

I created a new branch called bugfix-coreduce that includes the patch. I think this is the right way to do it but I don't have enough time to investigate further. The problem with the co_reduce test case is probably due to a compiler bug.

@floquet
Copy link
Author

floquet commented Apr 18, 2016

Thanks for making coarrrays a reality. Really appreciate the hard work.

On Sat, Apr 16, 2016 at 6:52 PM, Alessandro Fanfarillo <
notifications@github.com> wrote:

Thanks for reporting this. I've found the reason for this problem and I'm
producing a patch (for OpenCoarrays). What I don't completely understand is
why this problem hasn't come out earlier.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#172 (comment)

@floquet
Copy link
Author

floquet commented Apr 18, 2016

P.S. I hit the bug a about two months ago, but was convinced the problem
was my programming or my machine.

On Sat, Apr 16, 2016 at 6:52 PM, Alessandro Fanfarillo <
notifications@github.com> wrote:

Thanks for reporting this. I've found the reason for this problem and I'm
producing a patch (for OpenCoarrays). What I don't completely understand is
why this problem hasn't come out earlier.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#172 (comment)

@rouson
Copy link
Member

rouson commented May 24, 2016

@afanfa, any updates?

@afanfa
Copy link
Contributor

afanfa commented Nov 5, 2016

No updates. The issue is related to a compiler bug.

@rouson
Copy link
Member

rouson commented Nov 6, 2016

@afanfa, in that case, I assume it's for someone else to work on under contract. If so, we'll try to contract out the work to someone. If I'm incorrect and it's something you have time to work on, let us know. No pressure.

@afanfa
Copy link
Contributor

afanfa commented Nov 7, 2016

I guess so. Basically, the master branch has an implementation of co_reduce compatible with the code produced by the compiler for the co_reduce test currently included in the test suite. For the code proposed by @floquet (simpler than the one in the test suite) the master branch implementation produces wrong results. I looked into this bug long time ago; I would recommend to check it again before getting someone to work on this.

@zbeekman
Copy link
Collaborator

zbeekman commented Dec 2, 2016

@afanfa what needs to be done to resolve this? Is there any way I can help?

@vehre
Copy link
Collaborator

vehre commented Jan 9, 2017

While not seeing the issue before, I now get it also with gcc-6 and gcc-trunk.
Did something change, that I am not aware of?

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

While not seeing the issue before, I now get it also with gcc-6 and gcc-trunk.
Did something change, that I am not aware of?

@vehre Do you mean you see this behavior with the current unit tests, or do you have some other code that causes this behavior? Was it working for a given commit in the past? If so I can git bisect to find where it broke...

It looks like 14536e1 may fix this bug but @afanfa wanted to look into the issue some more....

Fixing this should definitely be prioritized, though. CC: @rouson @afanfa

@vehre
Copy link
Collaborator

vehre commented Jan 9, 2017

Is see the tescase co_reduce_test failing now with gcc-6 and gcc-trunk and on master of opencoarray. That wasn't so before.

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

@vehre: OK, I'll investigate and bisect assuming I can reproduce...

  • What OS are you on?
  • Bare metal or virtualized?
  • Number of physical or virtual cores?
  • How is MPI installed? (package manager vs install.sh, etc.)
  • Do you have an approximate idea of how long ago co_reduce_test was working correctly?

@vehre
Copy link
Collaborator

vehre commented Jan 9, 2017

  • Linux Fedora 25/x86_64 fully patched,
  • bare metal
  • 4
  • as rpm package
  • approx. 1 week ago I didn't notice it failing.

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

as rpm package

MPICH or OpenMPI?

@vehre
Copy link
Collaborator

vehre commented Jan 9, 2017

I have both on my system, but used MPICH to see the failure.
mpich-3.2-6.fc25.x86_64

zbeekman added a commit that referenced this issue Jan 19, 2017
 [issue #172](#172)
 causes co_reduce to return wrong results if the binary operator has
 arguments declared with the `value` attribute rather than
 `intent(in)`. Switching to use `intent(in)` is a valid work
 around. The binary operator is required to be a pure function with
 two arguments. As @LadaF points out, F2008 says:

 > C1276 The specification-part of a pure function subprogram shall
   specify that all its nonpointer dummy data objects have the
   `INTENT(IN)` or the `VALUE` attribute.

 Original coverage diff from adding this test was at:
 https://codecov.io/gh/sourceryinstitute/opencoarrays/compare/882c371d4d7e84364eb7adfba4b4f8d840e3f398...1a0e3b7edb6867d9e9371cbd9ed1d8f5b3dd2010/changes

 If this commit produces a different change in coverage then that
 likely indicates where there is a bug in the library. If the library
 coverage remains the same, then the bug is probably in the
 compiler. See discussion at
 [#172](#172)
zbeekman added a commit that referenced this issue Jan 19, 2017
 But configure it so that it passes and doesn't exhibit regression
zbeekman added a commit that referenced this issue Jan 19, 2017
 [issue #172](#172)
 causes co_reduce to return wrong results if the binary operator has
 arguments declared with the `value` attribute rather than
 `intent(in)`. Switching to use `intent(in)` is a valid work
 around. The binary operator is required to be a pure function with
 two arguments. As @LadaF points out, F2008 says:

 > C1276 The specification-part of a pure function subprogram shall
   specify that all its nonpointer dummy data objects have the
   `INTENT(IN)` or the `VALUE` attribute.

 Original coverage diff from adding this test was at:
 https://codecov.io/gh/sourceryinstitute/opencoarrays/compare/882c371d4d7e84364eb7adfba4b4f8d840e3f398...1a0e3b7edb6867d9e9371cbd9ed1d8f5b3dd2010/changes

 If this commit produces a different change in coverage then that
 likely indicates where there is a bug in the library. If the library
 coverage remains the same, then the bug is probably in the
 compiler. See discussion at
 [#172](#172)
zbeekman added a commit that referenced this issue Jan 19, 2017
 But configure it so that it passes and doesn't exhibit regression
@zbeekman
Copy link
Collaborator

The test I'm talking about isn't on the master branch yet. You can see it in PR #310 here: 2d685c2#diff-174a6f4266c23ced33d9405eca67795c

An alternate version that behaves correctly is here, also in PR #310: d7773ce#diff-c95b08ee3613c980a239d87e9edb4001

This variation in whether the binary operators arguments are intent(in) or value causes a difference in code coverage that I describe in the analysis above...

The coverage is combined from all tests... so it reporting the union of all lines hit when executing all tests (that get run) in src/tests from the OS X and Linux VM on Travis-CI

I'm not sure what your point about allreduce is... co_reduce calls MPI_allreduce in two places. All instances of code we are discussing call co_reduce which may or may not call MPI_allreduce.

I have rebased the branch and thus PR #310, so not all the commits are the same with the same hashes as they once had, but the old coverage data should still be accessible.

@vehre
Copy link
Collaborator

vehre commented Jan 20, 2017

I have forked another issue #317 to track the noexecstack issue and keep it separate from what is causing the intent(in)/value thing.

Pseudo-code signature of myprod() for intent(in) args a and b (for the code in #310 2d685c2#diff-174a6f4266c23ced33d9405eca67795c):

myprod (integer(kind=4) & restrict a, integer(kind=4) & restrict b)

and for value args:

myprod (integer(kind=4) restrict a, integer(kind=4) restrict b)

See the difference? In the intent(in) case the arguments are references (addresses to memory, pass by reference) in the value case they are the value itself (pass by value).

So the issue here is, that the library does not respect the flags passed to the collective function in opr_flags.

zbeekman added a commit that referenced this issue Jan 20, 2017
 [issue #172](#172)
 causes co_reduce to return wrong results if the binary operator has
 arguments declared with the `value` attribute rather than
 `intent(in)`. Switching to use `intent(in)` is a valid work
 around. The binary operator is required to be a pure function with
 two arguments. As @LadaF points out, F2008 says:

 > C1276 The specification-part of a pure function subprogram shall
   specify that all its nonpointer dummy data objects have the
   `INTENT(IN)` or the `VALUE` attribute.

 Original coverage diff from adding this test was at:
 https://codecov.io/gh/sourceryinstitute/opencoarrays/compare/882c371d4d7e84364eb7adfba4b4f8d840e3f398...1a0e3b7edb6867d9e9371cbd9ed1d8f5b3dd2010/changes

 If this commit produces a different change in coverage then that
 likely indicates where there is a bug in the library. If the library
 coverage remains the same, then the bug is probably in the
 compiler. See discussion at
 [#172](#172)
zbeekman added a commit that referenced this issue Jan 20, 2017
 But configure it so that it passes and doesn't exhibit regression
zbeekman added a commit that referenced this issue Jan 21, 2017
 [issue #172](#172)
 causes co_reduce to return wrong results if the binary operator has
 arguments declared with the `value` attribute rather than
 `intent(in)`. Switching to use `intent(in)` is a valid work
 around. The binary operator is required to be a pure function with
 two arguments. As @LadaF points out, F2008 says:

 > C1276 The specification-part of a pure function subprogram shall
   specify that all its nonpointer dummy data objects have the
   `INTENT(IN)` or the `VALUE` attribute.

 Original coverage diff from adding this test was at:
 https://codecov.io/gh/sourceryinstitute/opencoarrays/compare/882c371d4d7e84364eb7adfba4b4f8d840e3f398...1a0e3b7edb6867d9e9371cbd9ed1d8f5b3dd2010/changes

 If this commit produces a different change in coverage then that
 likely indicates where there is a bug in the library. If the library
 coverage remains the same, then the bug is probably in the
 compiler. See discussion at
 [#172](#172)
zbeekman added a commit that referenced this issue Jan 21, 2017
 But configure it so that it passes and doesn't exhibit regression
@zbeekman
Copy link
Collaborator

@vehre and @afanfa : 14536e1 fixes the code when value is present, however it breaks all the other co_reduce tests. Any idea how to get a fix for this that fixes this issue, but doesn't break the rest? I need to improve my C skills, I'm not quite there yet to be able to handle this myself...

zbeekman added a commit that referenced this issue Jan 22, 2017
Changes for resolving issues and release automation

 - Fixes #79
 - Fixes  #297 
 - Add regression test for #243 
 - Add regression for #172 (currently set to pass when test fails)
 - Migrates to use of a `.VERSION` file to make parsing easier for scripts and allow extra comments
 - Adds auto-upload of release assets upon tagging with git, but requires that the tag is PGP signed (`git tag -s <tag> [tree-ish]`)
 - This should compute the SHA 256 checksum and create a detached signature of the cryptographic SHA 256 checksum with an encrypted GPG subway I uploaded to the repo/travis. 🔮 🎩 🐇
@zbeekman zbeekman assigned vehre and unassigned afanfa Jan 31, 2017
vehre pushed a commit that referenced this issue Feb 8, 2017
Add support for recude-functions with value parameters.
Add support for char-arrays as arguments to reduce-functions.
Rename co_reduce_1 to internal_co_reduce to get a better naming.
Create mpi-datatype to transport the char-array length.

Fixes #172.
@ghost ghost added the needs-review label Feb 8, 2017
zbeekman added a commit that referenced this issue Feb 9, 2017
…pe_fix_172

Add value- and char-array support to co_reduce functions.

 - Fixes #172 
 - Fixes #324
@ghost ghost assigned zbeekman Feb 9, 2017
@ghost ghost removed the needs-review label Feb 9, 2017
@zbeekman
Copy link
Collaborator

zbeekman commented Feb 9, 2017

@floquet Fixed on master thanks to @vehre 🎉

@zbeekman zbeekman moved this from Unconfirmed GFortran modifications required to done in Bug squashing: Most wanted Feb 11, 2017
zbeekman added a commit that referenced this issue Apr 25, 2017
zbeekman added a commit to zbeekman/opencoarrays that referenced this issue Apr 26, 2017
@zbeekman zbeekman mentioned this issue Apr 26, 2017
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants