Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mamba as the conda-build solver via boa's mambabuild. #9953

Closed
wants to merge 4 commits into from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Dec 22, 2021

This switches from conda build to conda mambabuild, a command provided by boa. The goal is to speed up CI build times by spending less time with conda solving environments.

Below, I include a summary of the CI build time changes I observed. Overall this cuts our CI build time substantially (34 minutes to 27 minutes, roughly a 20% reduction).

Summary of boa from its README:

boa is a package builder for conda packages. It largely re-uses the conda-build infrastructure, except for some parts. For example the 'solving stage' which, in Boa, is done using mamba, the fast conda-alternative. Learn more about mamba here.

@bdice bdice added gpuCI 5 - DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 22, 2021
@bdice bdice self-assigned this Dec 22, 2021
@codecov
Copy link

codecov bot commented Dec 22, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@0e74fca). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head adb6667 differs from pull request most recent head 1ed7c7c. Consider uploading reports for the commit 1ed7c7c to get more accurate results

@@               Coverage Diff               @@
##             branch-22.08    #9953   +/-   ##
===============================================
  Coverage                ?   10.42%           
===============================================
  Files                   ?      119           
  Lines                   ?    20599           
  Branches                ?        0           
===============================================
  Hits                    ?     2148           
  Misses                  ?    18451           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e74fca...1ed7c7c. Read the comment docs.

@bdice
Copy link
Contributor Author

bdice commented Dec 22, 2021

Comparing the CI build logs, I see a significant improvement in build time.

CI Log Before (conda build) After (conda mambabuild)
Total Runtime 34 min 27 min
conda build/mambabuild libcudf "Total time" 0:15:28.7 0:13:22.1
conda build/mambabuild libcudf_kafka "Total time" 0:04:15.0 0:01:32.1
conda build/mambabuild libcudf_example "Total time" 0:03:01.6 0:00:58.2
Finalizing metadata libcudf1 0:08:43 0:07:01
Finalizing metadata libcudf_kafka1 0:01:28 0:00:21
Finalizing metadata libcudf_example1 0:01:16 0:00:18

1 Time measured from INFO:conda_build.metadata:Attempting to finalize metadata for libcudf to BUILD START: ['libcudf-22.02.00a-cuda11_g17b1dd4476_213.tar.bz2'] -- there are some environments solved in this section of the log, which I don't think are included in the "Total time."

@bdice bdice removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 22, 2021
@bdice bdice changed the title [TST] Use mamba as the conda-build solver via boa's mambabuild. Use mamba as the conda-build solver via boa's mambabuild. Dec 22, 2021
@firestarman
Copy link
Contributor

firestarman commented Dec 29, 2021

The failure in CI is due to the timeout during conda libraries installation.

16:18:49 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
16:58:49 Solving environment: ...working... Build timed out (after 40 minutes). Marking the build as failed.
16:58:50 Build was aborted
16:58:50 Recording test results
16:58:51 [Checks API] No suitable checks publisher found.
16:58:51 [Set GitHub commit status (universal)] ERROR on repos [GHRepository@2f92ba94[nodeId=MDEwOlJlcG9zaXRvcnk5MDUwNjkxOA==,description=cuDF - GPU DataFrame Library 

@bdice
Copy link
Contributor Author

bdice commented Dec 29, 2021

The failure in CI is due to the timeout during conda libraries installation.

16:18:49 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
16:58:49 Solving environment: ...working... Build timed out (after 40 minutes). Marking the build as failed.
16:58:50 Build was aborted
16:58:50 Recording test results
16:58:51 [Checks API] No suitable checks publisher found.
16:58:51 [Set GitHub commit status (universal)] ERROR on repos [GHRepository@2f92ba94[nodeId=MDEwOlJlcG9zaXRvcnk5MDUwNjkxOA==,description=cuDF - GPU DataFrame Library 

@firestarman Correct. The Java CI was already failing with that timeout error, and this PR does not fix the issue in the Java build. I tried to fix the Java CI in #9951 by switching to mamba but it ran into some issues finding the installed Java package, so I left the Java CI alone in that PR - I only fixed the other CI builds. I think the fix for Java CI will require using mamba and some manipulation of environment variables or CMake config. I haven’t looked closely yet.

See here: #9951 (comment)

@bdice bdice marked this pull request as ready for review January 4, 2022 15:59
@bdice bdice requested a review from a team as a code owner January 4, 2022 15:59
@bdice
Copy link
Contributor Author

bdice commented Jan 4, 2022

@rapidsai/ops-codeowners If this approach is fine with the ops team, I would like to install boa into the base image's rapids conda environment so it's available by default (rather than installing it during the CI script) but I don't know where this would go. A pointer for where to make a PR would be appreciated!

@bdice bdice added the 3 - Ready for Review Ready for review by team label Jan 4, 2022
@ajschmidt8
Copy link
Member

@rapidsai/ops-codeowners If this approach is fine with the ops team, I would like to install boa into the base image's rapids conda environment so it's available by default (rather than installing it during the CI script) but I don't know where this would go. A pointer for where to make a PR would be appreciated!

Hey @bdice. We had tried this before, but it seemed to have totally borked CI (see reversion PR below). It seems I didn't document what the exact issues were that it caused. Let me see if I can dig up those details.

@bdice
Copy link
Contributor Author

bdice commented Jan 6, 2022

@ajschmidt8 That's too bad about the gpuCI image. Would this PR's current implementation be acceptable, installing boa in the CI script? It seems to be passing fine (aside from the current issue with Java). The CI build time reduction would be quite nice.

@jjacobelli
Copy link
Contributor

@rapidsai/ops-codeowners If this approach is fine with the ops team, I would like to install boa into the base image's rapids conda environment so it's available by default (rather than installing it during the CI script) but I don't know where this would go. A pointer for where to make a PR would be appreciated!

Hey @bdice. We had tried this before, but it seemed to have totally borked CI (see reversion PR below). It seems I didn't document what the exact issues were that it caused. Let me see if I can dig up those details.

* [Revert "Install boa to gpuci/rapidsai images" gpuci-build-environment#223](https://github.com/rapidsai/gpuci-build-environment/pull/223)

If I remember well, mambabuild was "reseting" the CC and CXX variable leading to using the wrong gcc version while building the libs

@ajschmidt8 ajschmidt8 marked this pull request as draft January 10, 2022 20:03
@ajschmidt8
Copy link
Member

(switched to a draft PR for now)

@bdice
Copy link
Contributor Author

bdice commented Jan 11, 2022

If I remember well, mambabuild was "reseting" the CC and CXX variable leading to using the wrong gcc version while building the libs

@robertmaynard confirmed during a call today that the issue was with overriding/ignoring CC and CXX. I am hoping to collect a little more info so that we can file a bug with boa if appropriate. It sounded like this affected cugraph/cuml more than cudf. (I can't remember everything that was said and wasn't fast enough to take notes -- @robertmaynard please feel free to fill in anything else that would help to identify the problem.)

@robertmaynard
Copy link
Contributor

If I remember well, mambabuild was "reseting" the CC and CXX variable leading to using the wrong gcc version while building the libs

@robertmaynard confirmed during a call today that the issue was with overriding/ignoring CC and CXX. I am hoping to collect a little more info so that we can file a bug with boa if appropriate. It sounded like this affected cugraph/cuml more than cudf. (I can't remember everything that was said and wasn't fast enough to take notes -- @robertmaynard please feel free to fill in anything else that would help to identify the problem.)

Yes the issue comes up when the target project consumes some pre-built C++ libraries such as faiss? where C++ standard types such as std::string are part of the ABI. In those cases the pre-built libraries and ours have different C++ ABI modes ( pre/post C++11 ) and result in a link failure ( https://developers.redhat.com/blog/2020/10/08/migrating-c-and-c-applications-from-red-hat-enterprise-linux-version-7-to-version-8#migrating_shared_libraries )

@jjacobelli
Copy link
Contributor

rerun tests

1 similar comment
@jjacobelli
Copy link
Contributor

rerun tests

@ajschmidt8
Copy link
Member

If I remember well, mambabuild was "reseting" the CC and CXX variable leading to using the wrong gcc version while building the libs

@robertmaynard confirmed during a call today that the issue was with overriding/ignoring CC and CXX. I am hoping to collect a little more info so that we can file a bug with boa if appropriate. It sounded like this affected cugraph/cuml more than cudf. (I can't remember everything that was said and wasn't fast enough to take notes -- @robertmaynard please feel free to fill in anything else that would help to identify the problem.)

@bdice, is this issue still applicable?

@bdice
Copy link
Contributor Author

bdice commented Feb 2, 2022

Yes the issue comes up when the target project consumes some pre-built C++ libraries such as faiss? where C++ standard types such as std::string are part of the ABI. In those cases the pre-built libraries and ours have different C++ ABI modes ( pre/post C++11 ) and result in a link failure ( https://developers.redhat.com/blog/2020/10/08/migrating-c-and-c-applications-from-red-hat-enterprise-linux-version-7-to-version-8#migrating_shared_libraries )

@ajschmidt8 In my understanding, we need to make sure that the correct compilers are being used (hopefully this is shown in the CI logs), to avoid issues like the one @robertmaynard mentioned above. I'm not sure if that's a sufficient check or if some kind of deeper inspection is required.

@bdice
Copy link
Contributor Author

bdice commented Feb 2, 2022

23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CC' is being passed through with value '/usr/local/gcc9/bin/gcc'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CXX' is being passed through with value '/usr/local/gcc9/bin/g++'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
...
23:24:30 -- Check for working C compiler: /usr/local/gcc9/bin/gcc - skipped
...
23:24:30 -- Check for working CXX compiler: /usr/local/gcc9/bin/g++ - skipped
...

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/7430/consoleFull

Is this enough information to know if we're getting the right compilers?

@bdice bdice marked this pull request as ready for review February 2, 2022 22:49
@ajschmidt8 ajschmidt8 marked this pull request as draft February 3, 2022 14:05
@ajschmidt8
Copy link
Member

Is this enough information to know if we're getting the right compilers?

Hmm. Regardless of the compiler question, I think @Ethyling experienced some issues again after adding boa to our CI images. He'll continue to debug. In the meantime, I switched this back to draft.

@bdice
Copy link
Contributor Author

bdice commented Feb 3, 2022

Is this enough information to know if we're getting the right compilers?

Hmm. Regardless of the compiler question, I think @Ethyling experienced some issues again after adding boa to our CI images. He'll continue to debug. In the meantime, I switched this back to draft.

It’s also possible that some behavior differs in this PR from installing boa in the base image, based on the timing of installation.

For instance, if boa introduces activation scripts, those wouldn’t be activated by installing in an active environment as I did in this PR, but the activation scripts would be run by installing in the base image and later activating the environment. (Mamba cannot hook in to launching activation scripts on install, while conda does. This was an issue I faced with Java CI in #9983). Doesn’t look like there are activation scripts here, so I’d say this is not a likely theory unless there are some other fancy hooks into conda-build’s plugin system: https://github.com/conda-forge/boa-feedstock

@jjacobelli
Copy link
Contributor

Is this enough information to know if we're getting the right compilers?

Hmm. Regardless of the compiler question, I think @Ethyling experienced some issues again after adding boa to our CI images. He'll continue to debug. In the meantime, I switched this back to draft.

It’s also possible that some behavior differs in this PR from installing boa in the base image, based on the timing of installation.

For instance, if boa introduces activation scripts, those wouldn’t be activated by installing in an active environment as I did in this PR, but the activation scripts would be run by installing in the base image and later activating the environment. (Mamba cannot hook in to launching activation scripts on install, while conda does. This was an issue I faced with Java CI in #9983).

That's really helping to know this. We clearly need to double check this

@robertmaynard
Copy link
Contributor

23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CC' is being passed through with value '/usr/local/gcc9/bin/gcc'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CXX' is being passed through with value '/usr/local/gcc9/bin/g++'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
...
23:24:30 -- Check for working C compiler: /usr/local/gcc9/bin/gcc - skipped
...
23:24:30 -- Check for working CXX compiler: /usr/local/gcc9/bin/g++ - skipped
...

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/7430/consoleFull

Is this enough information to know if we're getting the right compilers?

No it isn't enough information. The specific problem we had was that while we used the /usr/local/gcc9 compiler it would still pick up the system libstdc++ and glibc instead of the packaged version in /usr/local/gcc9/. This caused the binaries we built to use the old C++ ABI instead of the new C++ ABI ( https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html ) while the conda provided libraries used the new C++ ABI.

So to correctly verify this works we need to build a C++ project that calls a conda provided library that uses std::string and std::list as part of the interface. Specifically we need to use function/methods of that library that require us to pass the above types so the linker doesn't drop unused functions.

@jjacobelli
Copy link
Contributor

jjacobelli commented Feb 3, 2022

23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CC' is being passed through with value '/usr/local/gcc9/bin/gcc'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CXX' is being passed through with value '/usr/local/gcc9/bin/g++'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
...
23:24:30 -- Check for working C compiler: /usr/local/gcc9/bin/gcc - skipped
...
23:24:30 -- Check for working CXX compiler: /usr/local/gcc9/bin/g++ - skipped
...

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/7430/consoleFull
Is this enough information to know if we're getting the right compilers?

No it isn't enough information. The specific problem we had was that while we used the /usr/local/gcc9 compiler it would still pick up the system libstdc++ and glibc instead of the packaged version in /usr/local/gcc9/. This caused the binaries we built to use the old C++ ABI instead of the new C++ ABI ( https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html ) while the conda provided libraries used the new C++ ABI.

So to correctly verify this works we need to build a C++ project that calls a conda provided library that uses std::string and std::list as part of the interface. Specifically we need to use function/methods of that library that require us to pass the above types so the linker doesn't drop unused functions.

@robertmaynard Do you think that moving to conda compilers could solve this issue?

@robertmaynard
Copy link
Contributor

23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CC' is being passed through with value '/usr/local/gcc9/bin/gcc'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
23:16:54 /opt/conda/envs/rapids/lib/python3.7/site-packages/conda_build/environ.py:449: UserWarning: The environment variable 'CXX' is being passed through with value '/usr/local/gcc9/bin/g++'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
23:16:54   UserWarning
...
23:24:30 -- Check for working C compiler: /usr/local/gcc9/bin/gcc - skipped
...
23:24:30 -- Check for working CXX compiler: /usr/local/gcc9/bin/g++ - skipped
...

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/7430/consoleFull
Is this enough information to know if we're getting the right compilers?

No it isn't enough information. The specific problem we had was that while we used the /usr/local/gcc9 compiler it would still pick up the system libstdc++ and glibc instead of the packaged version in /usr/local/gcc9/. This caused the binaries we built to use the old C++ ABI instead of the new C++ ABI ( https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html ) while the conda provided libraries used the new C++ ABI.
So to correctly verify this works we need to build a C++ project that calls a conda provided library that uses std::string and std::list as part of the interface. Specifically we need to use function/methods of that library that require us to pass the above types so the linker doesn't drop unused functions.

@robertmaynard Do you think that moving to conda compilers could solve this issue?

I think we would need to test them to see if it works as expected. IIRC we generally build outside of an active conda build environment and that could influence the behavior of the compilers

@jjacobelli jjacobelli marked this pull request as ready for review February 9, 2022 15:43
@jjacobelli jjacobelli marked this pull request as draft February 9, 2022 16:24
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@bdice bdice changed the base branch from branch-22.04 to branch-22.06 March 18, 2022 18:21
@bdice bdice changed the base branch from branch-22.06 to branch-22.08 May 23, 2022 14:06
@bdice
Copy link
Contributor Author

bdice commented May 23, 2022

Closing in favor of #10911.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants