Skip to content

Conversation

@ajocksch
Copy link
Contributor

@ajocksch ajocksch commented Jun 1, 2018

Closes #254

@ajocksch ajocksch self-assigned this Jun 1, 2018
@ajocksch ajocksch requested a review from vkarak June 1, 2018 14:39
@vkarak vkarak added this to the ReFrame sprint 2018w20 milestone Jun 4, 2018
class AutomaticArraysCheck(RegressionTest):
def __init__(self, **kwargs):
super().__init__('automatic_arrays_check',
os.path.dirname(__file__), **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the new syntax for regression tests? This boilerplate code won't be needed any more, as well as the _get_checks() function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajocksch You can check the tutorial examples to see the actual syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.stdout, 'perf', float)
}

self.aarrays_reference = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean arrays_reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.maintainers = ['AJ', 'VK']

def setup(self, partition, environ, **job_opts):
if 'PrgEnv-cray' in environ.name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using in here instead of ==?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it might be PrgEnv-cray/xxxyyy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got that from your other PR. I think though is better to check environ.name starts with PrgEnv-cray, because what you have now allows also xxx-PrgEnv-cray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

environ.fflags = '-O2'

super().setup(partition, environ, **job_opts)
self.reference = self.aarrays_reference[self.current_environ.name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are already using environ.name, you'd better move this before super().setup(...) and use environ.name here as well, for symmetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

self.maintainers = ['AJ', 'VK']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should tag this test as production, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it. However a few checks fail and we will have "red: in the ci.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem for that. We know that the programming environments are not working properly on Kesch. I will merge it as soon as the rest of the systems are "green".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@vkarak vkarak changed the title WIP: automatic arrays in compiler check Automatic arrays in compiler check Jun 4, 2018
@vkarak
Copy link
Contributor

vkarak commented Jun 10, 2018

@jenkins-cscs retry dom

class AutomaticArraysCheck(rfm.RegressionTest):
def __init__(self, **kwargs):
super().__init__()
self.valid_systems = ['daint:gpu', 'dom:gpu', 'kesch:cn']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajocksch The test fails constantly on Dom and (perhaps) will do so on the updated Daint. This needs investigation. The problem is that the performance checking is hardcoded inside this test, so performance checking in ReFrame has practically no effect (apart from logging) and, besides, we cannot adjust the performance values for other systems. For this reason, I don't think this test is really portable. I see two possible solutions here:

  1. Make the test more portable by separating the sanity checking from performance checking. The sanity should make sure that no validation errors occur (see source code). The performance numbers should be adapted for each system.
  2. Make this system available only for Kesch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @victorusu made the point in the stand-up meeting: This check should behave differently for the different versions of the compilers. We should check for negative results if expected. Thus * is not working, one needs to specify all programming environments separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajocksch @victorusu Guys, I understand that we need to have different performance values for the different programming environments, but the test itself is not portable. It assumes a single performance value (obtained perhaps only on a single system) and prints a PASS/FAIL based on that. If we want to do proper sanity and performance checking, we should ignore the PASS/FAIL printed by the test and let ReFrame do the performance checking based on the reference we put per system. If we go to the direction of putting this test in production for Daint/Dom, we should make it more robust and fix the performance values accordingly. If not, which is also my proposal since we want this test in ASAP, we should only allow it to run on Kesch.

@codecov-io
Copy link

codecov-io commented Jun 14, 2018

Codecov Report

Merging #311 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #311      +/-   ##
==========================================
+ Coverage    91.3%   91.32%   +0.02%     
==========================================
  Files          68       68              
  Lines        8107     8107              
==========================================
+ Hits         7402     7404       +2     
+ Misses        705      703       -2
Impacted Files Coverage Δ
reframe/core/config.py 84.54% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 527907f...4d140e8. Read the comment docs.

@vkarak
Copy link
Contributor

vkarak commented Jun 14, 2018

@ajocksch The test fails due to KeyError in the arrays_reference. Have a look here:

https://jenkins.cscs.ch/blue/organizations/jenkins/ReframeCI/detail/ReframeCI/588/pipeline

@ajocksch
Copy link
Contributor Author

The * in PrgEnv is the problem, since the keys are only for PrgEnv without *

One solution: Run the check for one PrgEnv only.

Another solution: Extend the dictionaries? or allow somehow the * for searches

@vkarak
Copy link
Contributor

vkarak commented Jun 14, 2018

You can also do the following:

if environ.name.startswith('PrgEnv-pgi'):
    key = 'PrgEnv-pgi'
else:
    key = environ.name

self.reference = self.arrays_reference[key]

@ajocksch
Copy link
Contributor Author

PrgEnv-pgi_16 PrgEnv-pgi_17 PrgEnv-pgi_18 fail as expected

also PrgEnv-cray_aj* fails since the mvapich* libraries set only the -I and -L path for mpif90 and not for ftn; this needs to be discussed

@vkarak
Copy link
Contributor

vkarak commented Jun 18, 2018

@jenkins-cscs retry all

@vkarak
Copy link
Contributor

vkarak commented Jun 18, 2018

@ajocksch This is the output I am getting:

SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for AutomaticArraysCheck
  * System partition: kesch:cn
  * Environment: PrgEnv-pgi_16
  * Stage directory: /users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_16
  * Job type: batch job (id=None)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: compile
  * Reason: OS error: [Errno 2] No such file or directory: 'mpif90': 'mpif90'
------------------------------------------------------------------------------
FAILURE INFO for AutomaticArraysCheck
  * System partition: kesch:cn
  * Environment: PrgEnv-pgi_17
  * Stage directory: /users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_17
  * Job type: batch job (id=None)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: compile
  * Reason: caught framework exception: Command '['mpif90', '-O2', '-ta=tesla,cc35,cuda8.0', '-I/users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_17', '/users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_17/automatic_arrays.f90', '-o', '/users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_17/./AutomaticArraysCheck']' failed with exit code 127:
=== STDOUT ===
=== STDERR ===
/appsmnt/escha/UES/RH7.3_experimental/pgi/18.4/linux86-64/2018/mpi/openmpi-2.1.2/bin/.bin/mpif90: error while loading shared libraries: libpgm.so: cannot open shared object file: No such file or directory

------------------------------------------------------------------------------
FAILURE INFO for AutomaticArraysCheck
  * System partition: kesch:cn
  * Environment: PrgEnv-pgi_18
  * Stage directory: /users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_18
  * Job type: batch job (id=817530)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: sanity
  * Reason: sanity error: pattern `Result: ' not found in `/users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_18/AutomaticArraysCheck.out'
------------------------------------------------------------------------------
FAILURE INFO for AutomaticArraysCheck
  * System partition: kesch:cn
  * Environment: PrgEnv-pgi_18_aj
  * Stage directory: /users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-pgi_18_aj
  * Job type: batch job (id=817528)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: performance
  * Reason: sanity error: 0.0001628 is beyond reference value 0.00014 (l=-inf, u=0.00016099999999999998)
------------------------------------------------------------------------------
FAILURE INFO for AutomaticArraysCheck
  * System partition: kesch:cn
  * Environment: PrgEnv-cray_aj
  * Stage directory: /users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-cray_aj
  * Job type: batch job (id=817483)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: sanity
  * Reason: sanity error: pattern `Result: ' not found in `/users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-cray_aj/AutomaticArraysCheck.out'
------------------------------------------------------------------------------
FAILURE INFO for AutomaticArraysCheck
  * System partition: kesch:cn
  * Environment: PrgEnv-cray_aj_b
  * Stage directory: /users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-cray_aj_b
  * Job type: batch job (id=817438)
  * Maintainers: ['AJ', 'VK']
  * Failing phase: sanity
  * Reason: sanity error: pattern `Result: ' not found in `/users/karakasv/Devel/reframe/stage/cn/AutomaticArraysCheck/PrgEnv-cray_aj_b/AutomaticArraysCheck.out'
------------------------------------------------------------------------------

You should not also rely on the CI, because it only runs a test if you have changed the test's Python file. In this case, you don't, that's why it does not run it. You should try it manually.

@ajocksch
Copy link
Contributor Author

ajocksch commented Jun 19, 2018

@lxavier it is necessary to set the variable MV2_USE_CUDA for the cray compiler and mvapich compiled for gcc, although no gpu-direct is used; otherwise the code hangs in the first OpenACC directives

@ajocksch
Copy link
Contributor Author

@lxavier the performance of the check is not 100% reproducible; there might be a problem with the dynamic adaptation of the clock frequencies of the nodes

@vkarak
Copy link
Contributor

vkarak commented Jun 19, 2018

@jenkins-cscs retry daint

@lxavier
Copy link
Contributor

lxavier commented Jun 20, 2018

@lxavier it is necessary to set the variable MV2_USE_CUDA for the cray compiler and ..

Interesting I think when we run cosmo on CPU we set MV2_USE_CUDA=0 , but we may use a different mvapich for cpu. Anyone, all this will have to be in the cosmo module files once Hannes is completed. Let it like this for now

@lxavier
Copy link
Contributor

lxavier commented Jun 20, 2018

@lxavier the performance of the check is not 100% reproducible

We try to make it long enough so that this should not be an issue. We can increase the threshold. We want to detect if the timing goes completly of. In addition we wanted to have a graphic to http://jenkins-mch.cscs.ch/view/POMPA/job/cosmo5_performance_benchmark/ so that we can monitor time, so it is ok if it fluctuate a bit.

@vkarak vkarak merged commit bdf2090 into master Jun 20, 2018
@vkarak vkarak deleted the checks/mch_automatic_arrays branch June 20, 2018 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants