Skip to content

Conversation

@vkarak
Copy link
Contributor

@vkarak vkarak commented May 29, 2021

This integrates @ekouts' CPU auto-detection work into the framework. ReFrame will try to auto-detect the system's topology the first time it runs on it and will reuse that in subsequent runs on the same system. More specifically:

  • If the processor info is specified in the configuration file, nothing is tried.
  • If the processor info is not specified in the configuration file, ReFrame looks processor info metadata files (which it might have produced in a previous run) and loads them. These metadata files are looked up in {configdir}/_meta/{system}-{part}/processor.json or in ~/.reframe/topology/{system}-{part}/processor.json in case of the builtin configuration file. If the file is found, the topology information is loaded from there.
  • If the topology file is not found, the topology will be autodetected. If the system partition is local (i.e., local scheduler + local launcher), the topology is auto-detected unconditionally. If the partition is remote, ReFrame will not try to auto-detect it unless the RFM_REMOTE_DETECT (or the corresponding configuration option) is set.
  • For detecting remote topologies, ReFrame will generate a job script based on the partition information and launch itself on the remote system with {launcher} reframe --detect-host-topology=topo.json. The --detect-host-topology option causes ReFrame to detect the topology of the current host. More specifically, ReFrame creates a temporary directory under . (by default) where it copies itself and re-bootstraps itself inside the job script, i.e., on the remote nodes. This is to account for environment differences between the local and remote hosts.
  • The temporary directory prefix can be changed by setting the RFM_REMOTE_WORKDIR environment variable.
  • In case of errors during auto-detection, ReFrame will simply issue a warning and continue.

I tested it on Dom, Eiger and Ault and works beautifully.

Todos

  • Write documentation
  • Add unit tests
  • Review and fine tune the implementation
  • Implement the mechanism for loading device configuration from a file, despite the lack of device auto-detection

Future work

Auto-detection of GPU devices is left for future work.

Fixes #1742.

@vkarak vkarak added this to the ReFrame Sprint 21.05.2 milestone May 29, 2021
@vkarak vkarak requested review from ekouts, jjotero and victorusu May 29, 2021 22:00
@vkarak vkarak self-assigned this May 29, 2021
@vkarak vkarak marked this pull request as draft May 29, 2021 22:01
@codecov-commenter
Copy link

codecov-commenter commented May 29, 2021

Codecov Report

Merging #1991 (6e5a086) into master (8960127) will decrease coverage by 0.75%.
The diff coverage is 67.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1991      +/-   ##
==========================================
- Coverage   87.51%   86.75%   -0.76%     
==========================================
  Files          50       52       +2     
  Lines        8891     9235     +344     
==========================================
+ Hits         7781     8012     +231     
- Misses       1110     1223     +113     
Impacted Files Coverage Δ
reframe/frontend/autodetect.py 58.64% <58.64%> (ø)
reframe/utility/cpuinfo.py 69.10% <69.10%> (ø)
reframe/frontend/cli.py 76.04% <84.21%> (+0.30%) ⬆️
reframe/core/systems.py 88.75% <85.71%> (+0.40%) ⬆️
reframe/utility/osext.py 84.98% <90.90%> (+0.26%) ⬆️
reframe/core/config.py 91.03% <100.00%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8960127...6e5a086. Read the comment docs.

@vkarak vkarak requested a review from teojgo May 31, 2021 08:17
@vkarak vkarak marked this pull request as ready for review June 7, 2021 08:58
Copy link
Contributor

@jjotero jjotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just very minor comments and a bit of a corner case where this fails (see the comment below on the autodetect.py file).

@pep8speaks
Copy link

pep8speaks commented Jul 1, 2021

Hello @vkarak, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2021-07-09 13:37:44 UTC

Copy link
Contributor

@jjotero jjotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the docs to have the right name for the config options and env vars. Other than this, looks good!

Copy link
Contributor

@jjotero jjotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vkarak
Copy link
Contributor Author

vkarak commented Jul 9, 2021

@jenkins-cscs retry daint pilatus

@vkarak vkarak merged commit 34ee3d0 into reframe-hpc:master Jul 9, 2021
@vkarak vkarak deleted the feat/cpu-autodetect branch July 9, 2021 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatically configure ReFrame for the underlying system

6 participants