Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NVidia HPC SDK (PGI) #2475

Merged
merged 27 commits into from Sep 12, 2020
Merged

Support NVidia HPC SDK (PGI) #2475

merged 27 commits into from Sep 12, 2020

Conversation

andriish
Copy link
Contributor

@andriish andriish commented Sep 9, 2020

In this PR:

  • Support NVidia HPC SDK with PG compilers
  • Factor out CMake filter function
  • Allow CMake to build even if pytest is missing (still shows warning)

Closes #2214 and closes #2407.

.github/workflows/ci.yml Outdated Show resolved Hide resolved
Comment on lines 269 to 272
if (WIN32)
target_compile_options(${target} PRIVATE -Wc,--pending_instantiations=0 )
else()
target_compile_options(${target} PRIVATE -Wc,--pending_instantiations=0 -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these always required, or just for the tests? If always, maybe they should go into pybind11Common for pybind11 or some similar target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-Wc,--pending_instantiations=0 is about the recursion level in templates. If I recall that correctly.
So, well, in some codes it will not be needed. But in the others the compilation will fail without this option.
Note that this option makes compilation significantly slower.

-D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 was deduced from experiments with old compilers on Linux.

tests/CMakeLists.txt Outdated Show resolved Hide resolved
@andriish
Copy link
Contributor Author

andriish commented Sep 9, 2020

Hi @henryiii ,

just to avoid misunderstanding, I will quote myself from the #2407
"I'm not sure if I will have any time to actually implement the full support, so the pull request is marked as work in progress.
Feel free to close it at any time.".
I mean that some things can be picked up from this merge request and might be useful. On some I can comment, the others would require some really dedicated work and changes in the build system. Not something that I have time for (even if I would like to in some cases). So most things are up you and other developers.

Best regards,
Andrii

Comment on lines 298 to 319
- 7 # GCC 4.8
- 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you are building twice, with identical compilers, just with a different base system? You aren't using the host GCC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @henryiii ,

There are differences, e.g. Python and cmake. But in general, I would consider CI to be an "environment test" rather than a test of some specific compiler version. The later has little practical value, as for me.

Best regards,
Andrii

@henryiii henryiii force-pushed the support_pgc branch 2 times, most recently from 6918c91 to 0a82626 Compare September 9, 2020 23:20
@henryiii
Copy link
Collaborator

henryiii commented Sep 9, 2020

Have you tried running the tests? It seems like many of them error out, mostly with things like "SystemError: Exception escaped from default exception translator!". Looks like this is pretty broken without this.

@henryiii
Copy link
Collaborator

henryiii commented Sep 10, 2020

Looked into PGI a bit more and realized we are probably making a fundamental mistake. Will fix tomorrow.

@andriish
Copy link
Contributor Author

Have you tried running the tests

Not the tests with exceptions. I see no hope for simple fixes for them.

we are probably making a fundamental mistake

I'm intrigued. Do you mean that it is not possible to support their implementation of stl? Or the approach is wrong?
Ok. Will just wait.

@henryiii
Copy link
Collaborator

I looked into PGI and it has complete C++11 support since 2015. So I then added #include <exception> (we do not run include-what-you-use, obviously) and that fixed all the issues with "missing" std::exception and such; PGI's stdlib is less "leaky" and we didn't get <exception> for free.

I re-enabled all the tests, and now only one test fails:

>       assert msg(capture) == strip_comments("""
            noisy new               # allocation required to attempt first overload
            noisy delete            # have to dealloc before considering factory init overload
            noisy new               # pointer factory calling "new", part 1: allocation
            NoisyAlloc(double 1.5)  # ... part two, invoking constructor
            ---
            ~NoisyAlloc()  # Destructor
            noisy delete   # operator delete
        """)
E       assert --- actual / +++ expected
E           noisy new
E         + noisy delete
E           noisy new
E           NoisyAlloc(double 1.5)
E           ---
E           ~NoisyAlloc()
E           noisy delete

../../tests/test_factory_constructors.py:358: AssertionError

@henryiii
Copy link
Collaborator

There are still quite a few warnings - we manually silence quite a few warnings for GCC/Clang, which obviously doesn't scale very well. I'm not sure how to manually silence them for PGI in-source.

@andriish
Copy link
Contributor Author

andriish commented Sep 10, 2020

are still quite a few warnings

          statement is unreachable
          return false;

Yes. That is why"more different compilers" is good for debug.

how to manually silence them for PGI in-source.

https://forums.developer.nvidia.com/t/disable-unreachable-warnings/133575

Unfortunately that is "all or nothing".

Best regards,

Andrii

Andrii Verbytskyi and others added 5 commits September 10, 2020 14:11
Added new CI config

Added new trigger

Changed CI workflow name

Debug CI

Debug CI

Debug CI

Debug CI

Added flags fro PGI

Disable Eigen

Removed tests that fail

Uncomment lines
fix: minor style cleanup

tests: support skipping

ci: remove and tighten a bit

fix: try msvc workaround for pgic
@henryiii henryiii changed the title [WIP] Support NVidia HPC SDK Support NVidia HPC SDK Sep 10, 2020
@henryiii henryiii marked this pull request as ready for review September 10, 2020 18:52
@henryiii
Copy link
Collaborator

I left the verbose in, wasn't too long. I've reordered that test and now pytest is no longer a FATAL_ERROR, so you can build without it.

@andriish
Copy link
Contributor Author

Can you check the file exception locally?

been there.
Literally the code from https://en.cppreference.com/w/cpp/error/exception_ptr

nvc++ 1.cc 
"1.cc", line 6: error: namespace "std" has no member "exception_ptr"
   void handle_eptr(std::exception_ptr eptr) // passing by value is ok
          

Also, grepping the nvidia installation shows that "exception_ptr" is located only in the libstdc++.ipl

@henryiii
Copy link
Collaborator

I think it's getting it from libstdc++ on your system (GCC 4.8), not from the install.

@andriish
Copy link
Contributor Author

I mean the definition. There is just one header with exception_ptr definition and that definition is behind very funny condition expressions.

@andriish
Copy link
Contributor Author

Could you, please try this code locally on CentOS7 with and w/o defines?

#if defined(__PGIC__) and defined(__GXX_EXPERIMENTAL_CXX0X__) and (__cplusplus >= 201103L)
#ifndef __GCC_ATOMIC_TEST_AND_SET_TRUEVAL
#define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
#endif
#if !defined(ATOMIC_INT_LOCK_FREE)
#define ATOMIC_INT_LOCK_FREE 3
#include <bits/exception_ptr.h>
#undef ATOMIC_INT_LOCK_FREE
#endif
#endif


#include <iostream>
#include <string>
#include <exception>
#include <stdexcept>

void handle_eptr(std::exception_ptr eptr) // passing by value is ok
{
    try {
        if (eptr) {
            std::rethrow_exception(eptr);
        }
    } catch(const std::exception& e) {
        std::cout << "Caught exception \"" << e.what() << "\"\n";
    }
}

int main()
{
    std::exception_ptr eptr;
    try {
        std::string().at(1); // this generates an std::out_of_range
    } catch(...) {
        eptr = std::current_exception(); // capture
    }
    handle_eptr(eptr);
} // destructor for std::out_of_range called here, when the eptr is destructed

@andriish
Copy link
Contributor Author

with -std=c++11

@henryiii
Copy link
Collaborator

I was able to reproduce locally (in docker). Very weird, no idea why this is different compared to docker on GHA! Try this:

cmake3 -S pybind11 -B build -DDOWNLOAD_CATCH=ON \
                              -DCMAKE_CXX_STANDARD=11 \
                             -DPYTHON_EXECUTABLE=$(python3 -c "import sys; print(sys.executable)") \
                             -DCMAKE_CXX_FLAGS="-Wc,--pending_instantiations=0 -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -D__GCC_ATOMIC_INT_LOCK_FREE=3" \
                             -DPYBIND11_TEST_FILTER="test_smart_ptr.cpp;test_virtual_functions.cpp"

@henryiii
Copy link
Collaborator

You are running natively, right (not containerized)?

@andriish
Copy link
Contributor Author

are running natively, right

Yes. Always! (no, sometimes in VB).

Ok, now more instanity. PGI 19.10 works perfectly.

@henryiii
Copy link
Collaborator

That compiles for me. Is "3" even a valid number for that? https://en.cppreference.com/w/c/atomic/ATOMIC_LOCK_FREE_consts

@andriish
Copy link
Contributor Author

Yes. Basically the <bits/exception_ptr.h> has condition bla-bla>2.

@henryiii
Copy link
Collaborator

I think PGI 20.7 doesn't support GCC 4.8 (even though it probably should if they are providing a repo for it, I bet it can be filed as a bug). The compiler is supposed to define that, and I think that's been dropped in 20.7. Except when running in GHA, in which case it magically doesn't drop that. That's about where I am. :S

@henryiii
Copy link
Collaborator

In short, since there is a command line workaround, I don't think we should hack the source. Probably report it as a bug with 20.7 and CentOS 7 and move on.

@henryiii
Copy link
Collaborator

henryiii commented Sep 11, 2020

We should probably link to this PR in somewhere, so if it breaks we can add the workaround to the CI.

@andriish
Copy link
Contributor Author

andriish commented Sep 11, 2020

magically

Container magic as is.

Ok. Agree to your suggestions. Basically to refine them:

  • put these flags into the cmake file with condition-specific to this compiler major and minor version

  • fill a bug report to Nvidia. I will do this and put you to cc.

  • Merge if you are ok with the changes. That is totally up to you.

What is your opinion?

Best regards,

Andrii

@henryiii
Copy link
Collaborator

I would not add the changes to CMake. Maybe put a note in. But this is a user workaround that has nothing to do with pybind11 (it's needed to compile the example code from cppref above), and it's OS (probably stdlib) + compiler specific. Just a note in the changlog, perhaps, saying this is supported but on old systems, you may need the workaround in issue #2475.

@henryiii
Copy link
Collaborator

(I'll try this locally on CentOS 8 later tonight)

@andriish
Copy link
Contributor Author

But this is a user workaround

Yes, so that should not appear in the code, but it is ok to put it in the test suite.

I also think this is a pure bug, not an "unsupported version".

@henryiii
Copy link
Collaborator

We don't understand the details of the bug, what versions it affects, or why the workaround is needed/what it affects. Hardcoding it into the test suite hides that the workaround is needed. Let's leave it as a note. (Unless another maintainer feels otherwise).

@andriish
Copy link
Contributor Author

Ok. Note is ok. One ca also issue a message during cmake configuration in case pgi is detected.

@henryiii
Copy link
Collaborator

message during cmake configuration in case pgi is detected

Once we understand it better, that's a possibility.

@henryiii
Copy link
Collaborator

When I did this by hand, I got, in CentOS 8:

"/usr/include/c++/8/bits/atomic_base.h", line 210: error: identifier
          "__GCC_ATOMIC_TEST_AND_SET_TRUEVAL" is undefined
      { return __i ? __GCC_ATOMIC_TEST_AND_SET_TRUEVAL : 0; }

(Again, this is a problem out of the box, not with pybind11).

Looking here: https://forums.developer.nvidia.com/t/support-for-atomic-in-libstdc-missing/135403/4

It looks a bit to me like that it was rather designed for an HPC admin to create a siterc that has defines like this that describe the HPC hardware that you will be targeting (since you compile on a host node, not a compute node, in general, and they may not be homogenous). That also is likely why it uses environment-modules; this is really not intended for "normal" users to drop in and go. The docker container is set up properly, but the rest is "standard PGI" configuration that you have to do.

(I could totally be wrong here. I also rather don't think they are targeting normal users yet, just HPC clusters)

@henryiii henryiii merged commit 38370a8 into pybind:master Sep 12, 2020
@henryiii
Copy link
Collaborator

Let me know / CC me on the issue with NVIDIA! For now, this should now be better / compilable via flags or site configuration instead of source changes.

@henryiii
Copy link
Collaborator

The rpm repo seems to be a little unstable. If it causes build to fail occasionally in the long run, I'll move to the container for the main test suite and make the two CentOS builds weekly or something like that.

@YannickJadoul
Copy link
Collaborator

Congrats on getting this to work, @henryiii & @andriish! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Support for Nvidia (former Portland Group) compilers PGI compiler issues
4 participants