Skip to content

Conversation

@lucamar
Copy link
Contributor

@lucamar lucamar commented Feb 6, 2020

The pull request defines the checks that are currently run on the MCH system Kesch on the new systems Arolla and Tsa. Only few checks still need to be adapted (Fieldextra,LETKF,...).

Expected failures in this pull request that are still under investigation (checked if fixed):

lucamar and others added 30 commits August 19, 2019 17:13
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm now. Thanks @lucamar for the effort!

@vkarak
Copy link
Contributor

vkarak commented Feb 14, 2020

@lucamar The PR seems to be breaking CDO tests on Kesch: https://jenkins.cscs.ch/blue/organizations/jenkins/ReframeCI/detail/ReframeCI/2966/pipeline/20, whereas they don't fail on master.

@lucamar
Copy link
Contributor Author

lucamar commented Feb 14, 2020

@vkarak Thanks for the information, my last commit should have fixed the issue with CDO on Kesch.

@vkarak
Copy link
Contributor

vkarak commented Feb 14, 2020

Thanks @lucamar. If the CI passes except for the failures mentioned in the description, I will merge this PR.

@vkarak
Copy link
Contributor

vkarak commented Feb 14, 2020

@lucamar Some more tests are broken on Kesch. Tsa/Arolla seem fine. Can you please check?

@lucamar
Copy link
Contributor Author

lucamar commented Feb 14, 2020

I was not able to reproduce the failure of AllocSpeedTest_no on Kesch, therefore I will only test Tsa for the moment. Furthermore, Kesch post-processing nodes are currently heavily used.

@lucamar
Copy link
Contributor Author

lucamar commented Feb 14, 2020

@jenkins-cscs retry tsa

@lucamar
Copy link
Contributor Author

lucamar commented Feb 14, 2020

The other failure on Kesch is KernelLatencyTest_sync, but after I fixed it I still see that the check fails due to low performance:

  * Reason: performance error: failed to meet reference: latency=13.846, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=14.1951, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=13.7136, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=13.2608, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=13.8325, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=13.2889, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=13.9814, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------
  * Reason: performance error: failed to meet reference: latency=13.6442, expected 12.0 (l=-inf, u=13.200000000000001)
------------------------------------------------------------------------------

Therefore I have increased the expected value to the average of the above 8 results (i.e. 13.7).

@lucamar
Copy link
Contributor Author

lucamar commented Feb 14, 2020

I can confirm that now only the checks expected to fail are not yet passing on Tsa (see list above).
I have pushed the fix for the KernelLatencyTest_sync on Kesch and I will monitor the results.

@lucamar
Copy link
Contributor Author

lucamar commented Feb 14, 2020

@jenkins-cscs retry kesch

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple of minor comments. I could fix them.

@vkarak
Copy link
Contributor

vkarak commented Feb 14, 2020

For DGEMM and GPU burn on Kesch we should open internal issues.

@vkarak
Copy link
Contributor

vkarak commented Feb 14, 2020

@jenkins-cscs retry kesch tsa

@vkarak
Copy link
Contributor

vkarak commented Feb 15, 2020

Thanks @lucamar for this PR. I will now merge it. Check UES-717 and UES-718 for the failures on Kesch.

@vkarak vkarak merged commit e936465 into reframe-hpc:master Feb 15, 2020
@lucamar lucamar deleted the tsa-19.04 branch February 18, 2020 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants