Skip to content

Conversation

@ajocksch
Copy link
Contributor

@ajocksch ajocksch commented Jul 5, 2018

Closes #253

@ajocksch ajocksch requested a review from teojgo July 5, 2018 10:02
@ajocksch
Copy link
Contributor Author

ajocksch commented Jul 5, 2018

I get this error message from the performance check

debug: Alltoallv_ on kesch:cn using PrgEnv-gnu: caught socket.gaierror: [Errno -2] Name or service not known

@vkarak vkarak self-requested a review July 5, 2018 10:08
@vkarak vkarak added this to the ReFrame sprint 2018w26 milestone Jul 5, 2018
@vkarak vkarak changed the title WIP: alltoallv check; socket.gaierror [WIP] alltoallv check; socket.gaierror Jul 5, 2018
@teojgo
Copy link
Contributor

teojgo commented Jul 6, 2018

@ajocksch the error you get is the one of issue #349

@vkarak
Copy link
Contributor

vkarak commented Jul 10, 2018

@ajocksch Can you be a bit more precise on the title of your PR? Also the socket error is no more relevant; it was fixed in #352.

@teojgo
Copy link
Contributor

teojgo commented Jul 11, 2018

After the merge of #352, the tests now complete successfully on kesch. I do not know if they are relevant to daint/dom since the module craype-network-infiniband is not available there.

@vkarak
Copy link
Contributor

vkarak commented Jul 11, 2018

This PR is not ready to merge. I have several comments. I will review it soon.


def setup(self, partition, environ, **job_opts):
if environ.name.startswith('PrgEnv-cray'):
environ.fflags = '-O2 -hacc -hnoomp'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the environment flags is now deprecated. You should use the build systems for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.num_gpus_per_node = 16
self.num_tasks_per_node = 16
self.num_tasks_per_socket = 8
self.executable = 'src/comm_overlap_benchmark %s' % exec_parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.executable_opts to pass the parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.num_tasks_per_node = 16
self.num_tasks_per_socket = 8
self.executable = 'src/comm_overlap_benchmark %s' % exec_parameter
self.sourcesdir = ('https://github.com/cosunae/comm_overlap_bench')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the parentheses here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

'kesch:cn': {
'perf': (5.62155, None, 0.15)
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is too much code duplication in the above if/else. Rewrite it as follows:

if not exec_parameter:
    ref = 5.53777
elif exec_parameter == '--nocomm':
    ref = 5.7878
elif exec_parameter == '--nocomp':
    ref = 5.62155

self.reference = {
    'kesch:cn': {
        'perf': (ref, None, 0.15)
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.sourcepath = 'src'
self.sanity_patterns = sn.assert_found(r'ELAPSED TIME:', self.stdout)
self.perf_patterns = {
'perf': sn.extractsingle(r'ELAPSED TIME:\s+(?P<perf>\S+)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please name the performance variable with a more meaningful name. What this metric is? Latency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elapsed_time

},
}

self.modules += ['craype-haswell', 'craype-network-infiniband', 'mvapich2gdr_gnu/2.2_cuda_8.0', 'cray-libsci_acc/17.03.1', 'cmake']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

-DCUDA_COMPUTE_CAPABILITY="sm_37" \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_MPI_TIMER=ON'
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should be revised as soon as #299 is fixed.

config/cscs.py Outdated
'_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}']
'_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}'],
'distribution': ['--distribution=block:block'],
'cpu_bind' : ['--cpu_bind=q']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource names must be more generic and allow tests to pass options:

'resources': {
    'task_placement': ['--distribution={distribution}', '--cpu_bind={cpu_binding}']
}

Then you should define the extra_resources as follows:

self.extra_resources = {
    'task_placement': {
        'distribution': 'block:block',
        'cpu_binding': 'q'
}

But I don't think that the --cpu_bind option is needed: q is the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is outdated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this comment is outdated?

@vkarak vkarak removed this from the ReFrame sprint 2018w35 milestone Sep 4, 2018
@ajocksch ajocksch changed the title [WIP] alltoallv check; socket.gaierror alltoallv check; socket.gaierror Oct 4, 2018
@vkarak vkarak changed the title alltoallv check; socket.gaierror [test] Add MCH alltoallv check Oct 4, 2018
@vkarak vkarak added this to the Upcoming sprint milestone Oct 9, 2018
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several comments. It seems you are mixing parts that are relevant for different systems (Daint/Dom and Kesch) and different MPI versions. Have you tested this on Daint/Dom? I'm sure it would have failed. I suggest focusing only on Kesch, remove Daint/Dom from the supported systems list, as well as all relevant bits from the building and running. Unless it's straightforward to support Daint/Dom.

config/cscs.py Outdated
'_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}']
'_rfm_gpu': ['--gres=gpu:{num_gpus_per_node}'],
'distribution': ['--distribution=block:block'],
'cpu_bind' : ['--cpu_bind=q']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this comment is outdated?

import reframe.utility.sanity as sn


@rfm.parameterized_test([''], ['--nocomm'], ['--nocomp'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's handy, but I wouldn't pass the option here. I would rename the exec_parameter to variant and pass three distinct values: default, nocomm, nocomp. I would then set up the executable options accordingly. The way it is now will produce ugly test names, e.g., Alltoallv___nocomm.

if self.current_system.name in ['daint', 'dom']:
self.modules = ['craype-accel-nvidia60']
self._pgi_flags = ['-acc', '-ta=tesla:cc60', '-Mnorpath']
elif self.current_system.name in ['kesch']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== 'kesch'

if environ.name.startswith('PrgEnv-cray'):
self.build_system.fflags = ['-O2', '-hacc', '-hnoomp']
elif environ.name.startswith('PrgEnv-pgi'):
self.build_system.fflags = [self._pgi_flags]
Copy link
Contributor

@vkarak vkarak Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this is allowed. It should have been:

self.build_system.fflags = self._pgi_flags

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this simply works because we are not running this test for the PrgEnv-pgi.

self.num_tasks_per_node = 16
self.num_tasks_per_socket = 8
self.executable = ('build/src/comm_overlap_benchmark '
'%s' % exec_parameter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exec_parameter should be part of the self.executable_opts list.


self.modules += [
'craype-haswell', 'craype-network-infiniband',
'mvapich2gdr_gnu/2.2_cuda_8.0', 'cray-libsci_acc/17.03.1', 'cmake'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvapich2gdr_gnu/2.2_cuda_8.0 is supposed to be loaded already by ReFrame's PrgEnv-cray. Also verify that the other modules are still needed.


self.variables = {
'G2G': '1',
'jobs': '144',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not to be used.


self.pre_run = [
'export BOOST_LIBRARY_PATH=/apps/escha/UES/PrgEnv-gnu-17.02'\
'/modulefiles/boost/1.63.0-gmvolf-17.02-python-2.7.13/lib',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not split the lines with \. I would suggest to put each path in a single long line, even if it is beyond 80 columns.

'export XXX_LIBRARY_PATH=/apps/escha/UES/RH7.3_experimental/pgi'\
'/17.10/linux86-64/17.10/REDIST',
'export LD_LIBRARY_PATH=$XXX_LIBRARY_PATH:$LD_LIBRARY_PATH',
'export LD_PRELOAD=/opt/mvapich2/gdr/2.3a/mcast/no-openacc'\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to contradict with the mvapich2gdr_gnu/2.2_cuda_8.0 loaded above.

}

self.pre_run = [
'export BOOST_LIBRARY_PATH=/apps/escha/UES/PrgEnv-gnu-17.02'\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a module for Boost?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed on Kesch or just for Daint/Dom.

@vkarak
Copy link
Contributor

vkarak commented Oct 10, 2018

@ajocksch Indeed it does not work on Dom:

[==========] Running 3 check(s)
[==========] Started on Wed Oct 10 16:56:49 2018

[----------] started processing Alltoallv_ (Alltoallv_)
[     FAIL ] Alltoallv_ on dom:gpu using PrgEnv-gnu
[----------] finished processing Alltoallv_ (Alltoallv_)

[----------] started processing Alltoallv___nocomm (Alltoallv___nocomm)
[     FAIL ] Alltoallv___nocomm on dom:gpu using PrgEnv-gnu
[----------] finished processing Alltoallv___nocomm (Alltoallv___nocomm)

[----------] started processing Alltoallv___nocomp (Alltoallv___nocomp)
[     FAIL ] Alltoallv___nocomp on dom:gpu using PrgEnv-gnu
[----------] finished processing Alltoallv___nocomp (Alltoallv___nocomp)

[----------] waiting for spawned checks to finish
[----------] all spawned checks have finished

[  FAILED  ] Ran 3 test case(s) from 3 check(s) (3 failure(s))
[==========] Finished on Wed Oct 10 16:56:51 2018

==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for Alltoallv_
  * System partition: dom:gpu
  * Environment: PrgEnv-gnu
  * Stage directory: None
  * Job type: batch job (id=-1)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: setup
  * Reason: caught framework exception: could not load module craype-network-infiniband
------------------------------------------------------------------------------
FAILURE INFO for Alltoallv___nocomm
  * System partition: dom:gpu
  * Environment: PrgEnv-gnu
  * Stage directory: None
  * Job type: batch job (id=-1)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: setup
  * Reason: caught framework exception: could not load module craype-network-infiniband
------------------------------------------------------------------------------
FAILURE INFO for Alltoallv___nocomp
  * System partition: dom:gpu
  * Environment: PrgEnv-gnu
  * Stage directory: None
  * Job type: batch job (id=-1)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: setup
  * Reason: caught framework exception: could not load module craype-network-infiniband
------------------------------------------------------------------------------

@vkarak vkarak self-assigned this Oct 10, 2018
@vkarak
Copy link
Contributor

vkarak commented Oct 11, 2018

@ajocksch I have fixed and adapted the test for both Kesch and Dom/Daint. It turned out to be much easier. Their build scripts seem quite outdated. Please have a look so that we can merge by today.

self.num_tasks_per_node = 1
self.modules = ['craype-accel-nvidia60', 'CMake']
self.variables['MPICH_RDMA_ENABLED_CUDA'] = '1'
self.build_system.config_opts += ['-DMPI_VENDOR=mvapich2',
Copy link
Contributor

@vkarak vkarak Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, I should adapt the CMake options for the test on Daint/Dom.

@vkarak
Copy link
Contributor

vkarak commented Oct 11, 2018

@jenkins-cscs retry dom kesch

@vkarak
Copy link
Contributor

vkarak commented Oct 11, 2018

@jenkins-cscs retry kesch

@vkarak vkarak merged commit e0d608a into master Oct 11, 2018
@vkarak vkarak deleted the checks/mch_comm_overlap_bench_barebones branch October 11, 2018 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants