-
Notifications
You must be signed in to change notification settings - Fork 117
[test] Add MCH alltoallv check
#347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I get this error message from the performance check debug: Alltoallv_ on kesch:cn using PrgEnv-gnu: caught socket.gaierror: [Errno -2] Name or service not known |
|
After the merge of #352, the tests now complete successfully on kesch. I do not know if they are relevant to daint/dom since the module |
|
This PR is not ready to merge. I have several comments. I will review it soon. |
cscs-checks/mch/alltoallv.py
Outdated
|
|
||
| def setup(self, partition, environ, **job_opts): | ||
| if environ.name.startswith('PrgEnv-cray'): | ||
| environ.fflags = '-O2 -hacc -hnoomp' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the environment flags is now deprecated. You should use the build systems for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cscs-checks/mch/alltoallv.py
Outdated
| self.num_gpus_per_node = 16 | ||
| self.num_tasks_per_node = 16 | ||
| self.num_tasks_per_socket = 8 | ||
| self.executable = 'src/comm_overlap_benchmark %s' % exec_parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.executable_opts to pass the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cscs-checks/mch/alltoallv.py
Outdated
| self.num_tasks_per_node = 16 | ||
| self.num_tasks_per_socket = 8 | ||
| self.executable = 'src/comm_overlap_benchmark %s' % exec_parameter | ||
| self.sourcesdir = ('https://github.com/cosunae/comm_overlap_bench') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the parentheses here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cscs-checks/mch/alltoallv.py
Outdated
| 'kesch:cn': { | ||
| 'perf': (5.62155, None, 0.15) | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is too much code duplication in the above if/else. Rewrite it as follows:
if not exec_parameter:
ref = 5.53777
elif exec_parameter == '--nocomm':
ref = 5.7878
elif exec_parameter == '--nocomp':
ref = 5.62155
self.reference = {
'kesch:cn': {
'perf': (ref, None, 0.15)
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cscs-checks/mch/alltoallv.py
Outdated
| self.sourcepath = 'src' | ||
| self.sanity_patterns = sn.assert_found(r'ELAPSED TIME:', self.stdout) | ||
| self.perf_patterns = { | ||
| 'perf': sn.extractsingle(r'ELAPSED TIME:\s+(?P<perf>\S+)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please name the performance variable with a more meaningful name. What this metric is? Latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elapsed_time
cscs-checks/mch/alltoallv.py
Outdated
| }, | ||
| } | ||
|
|
||
| self.modules += ['craype-haswell', 'craype-network-infiniband', 'mvapich2gdr_gnu/2.2_cuda_8.0', 'cray-libsci_acc/17.03.1', 'cmake'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cscs-checks/mch/alltoallv.py
Outdated
| -DCUDA_COMPUTE_CAPABILITY="sm_37" \ | ||
| -DCMAKE_BUILD_TYPE=Release \ | ||
| -DENABLE_MPI_TIMER=ON' | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be revised as soon as #299 is fixed.
config/cscs.py
Outdated
| '_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}'] | ||
| '_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}'], | ||
| 'distribution': ['--distribution=block:block'], | ||
| 'cpu_bind' : ['--cpu_bind=q'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource names must be more generic and allow tests to pass options:
'resources': {
'task_placement': ['--distribution={distribution}', '--cpu_bind={cpu_binding}']
}Then you should define the extra_resources as follows:
self.extra_resources = {
'task_placement': {
'distribution': 'block:block',
'cpu_binding': 'q'
}But I don't think that the --cpu_bind option is needed: q is the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this comment is outdated?
…cks/mch_comm_overlap_bench_barebones
alltoallv check
vkarak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have several comments. It seems you are mixing parts that are relevant for different systems (Daint/Dom and Kesch) and different MPI versions. Have you tested this on Daint/Dom? I'm sure it would have failed. I suggest focusing only on Kesch, remove Daint/Dom from the supported systems list, as well as all relevant bits from the building and running. Unless it's straightforward to support Daint/Dom.
config/cscs.py
Outdated
| '_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}'] | ||
| '_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}'], | ||
| 'distribution': ['--distribution=block:block'], | ||
| 'cpu_bind' : ['--cpu_bind=q'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this comment is outdated?
cscs-checks/mch/alltoallv.py
Outdated
| import reframe.utility.sanity as sn | ||
|
|
||
|
|
||
| @rfm.parameterized_test([''], ['--nocomm'], ['--nocomp']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's handy, but I wouldn't pass the option here. I would rename the exec_parameter to variant and pass three distinct values: default, nocomm, nocomp. I would then set up the executable options accordingly. The way it is now will produce ugly test names, e.g., Alltoallv___nocomm.
cscs-checks/mch/alltoallv.py
Outdated
| if self.current_system.name in ['daint', 'dom']: | ||
| self.modules = ['craype-accel-nvidia60'] | ||
| self._pgi_flags = ['-acc', '-ta=tesla:cc60', '-Mnorpath'] | ||
| elif self.current_system.name in ['kesch']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== 'kesch'
cscs-checks/mch/alltoallv.py
Outdated
| if environ.name.startswith('PrgEnv-cray'): | ||
| self.build_system.fflags = ['-O2', '-hacc', '-hnoomp'] | ||
| elif environ.name.startswith('PrgEnv-pgi'): | ||
| self.build_system.fflags = [self._pgi_flags] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how this is allowed. It should have been:
self.build_system.fflags = self._pgi_flagsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this simply works because we are not running this test for the PrgEnv-pgi.
cscs-checks/mch/alltoallv.py
Outdated
| self.num_tasks_per_node = 16 | ||
| self.num_tasks_per_socket = 8 | ||
| self.executable = ('build/src/comm_overlap_benchmark ' | ||
| '%s' % exec_parameter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exec_parameter should be part of the self.executable_opts list.
cscs-checks/mch/alltoallv.py
Outdated
|
|
||
| self.modules += [ | ||
| 'craype-haswell', 'craype-network-infiniband', | ||
| 'mvapich2gdr_gnu/2.2_cuda_8.0', 'cray-libsci_acc/17.03.1', 'cmake' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mvapich2gdr_gnu/2.2_cuda_8.0 is supposed to be loaded already by ReFrame's PrgEnv-cray. Also verify that the other modules are still needed.
cscs-checks/mch/alltoallv.py
Outdated
|
|
||
| self.variables = { | ||
| 'G2G': '1', | ||
| 'jobs': '144', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not to be used.
cscs-checks/mch/alltoallv.py
Outdated
|
|
||
| self.pre_run = [ | ||
| 'export BOOST_LIBRARY_PATH=/apps/escha/UES/PrgEnv-gnu-17.02'\ | ||
| '/modulefiles/boost/1.63.0-gmvolf-17.02-python-2.7.13/lib', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not split the lines with \. I would suggest to put each path in a single long line, even if it is beyond 80 columns.
cscs-checks/mch/alltoallv.py
Outdated
| 'export XXX_LIBRARY_PATH=/apps/escha/UES/RH7.3_experimental/pgi'\ | ||
| '/17.10/linux86-64/17.10/REDIST', | ||
| 'export LD_LIBRARY_PATH=$XXX_LIBRARY_PATH:$LD_LIBRARY_PATH', | ||
| 'export LD_PRELOAD=/opt/mvapich2/gdr/2.3a/mcast/no-openacc'\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to contradict with the mvapich2gdr_gnu/2.2_cuda_8.0 loaded above.
cscs-checks/mch/alltoallv.py
Outdated
| } | ||
|
|
||
| self.pre_run = [ | ||
| 'export BOOST_LIBRARY_PATH=/apps/escha/UES/PrgEnv-gnu-17.02'\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have a module for Boost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed on Kesch or just for Daint/Dom.
|
@ajocksch Indeed it does not work on Dom: |
|
@ajocksch I have fixed and adapted the test for both Kesch and Dom/Daint. It turned out to be much easier. Their build scripts seem quite outdated. Please have a look so that we can merge by today. |
cscs-checks/mch/alltoallv.py
Outdated
| self.num_tasks_per_node = 1 | ||
| self.modules = ['craype-accel-nvidia60', 'CMake'] | ||
| self.variables['MPICH_RDMA_ENABLED_CUDA'] = '1' | ||
| self.build_system.config_opts += ['-DMPI_VENDOR=mvapich2', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, I should adapt the CMake options for the test on Daint/Dom.
|
@jenkins-cscs retry dom kesch |
|
@jenkins-cscs retry kesch |
Closes #253