[feat] Improve performance logging #2633

vkarak · 2022-10-15T22:58:03Z

This PR improves significantly how performance logging is done, as well as how the filelog handler behaves. More specifically:

The performance of the test is no more logged from within the test inside the check_performance() method, but instead it is logged through the associated task from test executor after the test has finished. This allows us to log the final test result by introducing the new check_result log record attribute. We could also log any task-related information, but we can add this on-demand in the future.
Performance logging through the filelog handler is completely revised:
- All performance variables are logged at once in a single line.
- A new option named format_perfvars is introduced for formatting the performance variables. This will be used to format the %(check_perfvalues)s specifier.
- The handler will now emit a header line describing what each logged field is. This header is not generic, but it gets its information from the record itself, in order to obtain the name of the performance variables that are being logged.
- To ensure consistency of the produced log files, if a header is changed in a subsequent run of the same test (e.g., the format or the performance variables changed), a new file will be generated and the old one will be saved.
- The filelog handler in generic settings was adapted accordingly and it now produces pure csv output, so that it can be easily imported in other tools for visualization.
If either the format or the perfvars_format are wrong, we don't crash anymore and we issue an error message as a log message.
The performance checking phase has been revised and simplified. We do not maintain anymore two code paths for the legacy perf_patterns and the new syntax. We convert the old syntax to the new one and continue from there.

Users can still opt in for the old behaviour where a a new record was logged for each performance variable and the $(check_perf_*)s specifier was valid in each log record and not only for formatting the %(check_perfvalues)s. This behaviour can be enabled by setting the logging/perflog_compat configuration parameter to true.

Closes #1543.
Closes #2394.

Todos

Update the documentation
Update the configuration excerpts in the tutorial examples (if needed)
Check the FIXME note in pipeline.py
Add unit tests for the filelog handler

1. The performance of the test is no more logged from within the test inside the `check_performance()` method, but instead it is logged through the associated task from test executor _after_ the test has finished. This allows us to log also the final test result by introducing the new `check_result` test attribute. 2. Performance logging through the `filelog` handler is completely revised: - All performance variables are logged at once in a single line. - A new option named `format_perfvars` is introduced for formatting the performance variables. This will be used to format the `%(check_perfvalues)s` specifier. - The handler will now emit a header line describing what each logged field is. This header is not generic, but it gets its information from the record itself, in order to obtain the name of the performance variables that are being logged. - To ensure consistency of the produced log files, if a header is changed in a subsequent run of the same test (e.g., the format or the performance variables changed), a new file will be generated and the old one will be saved. - The `filelog` handler in generic settings was adapted accordingly and it now produces pure csv output, so that it can be easily imported in other tools for visualization.

codecov-commenter · 2022-10-15T23:18:50Z

Codecov Report

Base: 86.26% // Head: 86.15% // Decreases project coverage by -0.11% ⚠️

Coverage data is based on head (911a777) compared to base (db93224).
Patch coverage: 74.69% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2633      +/-   ##
==========================================
- Coverage   86.26%   86.15%   -0.12%     
==========================================
  Files          60       60              
  Lines       10985    11047      +62     
==========================================
+ Hits         9476     9517      +41     
- Misses       1509     1530      +21

Impacted Files	Coverage Δ
reframe/core/settings.py	`100.00% <ø> (ø)`
reframe/core/logging.py	`75.58% <71.62%> (-1.43%)`	⬇️
reframe/core/pipeline.py	`93.35% <100.00%> (-0.01%)`	⬇️
reframe/frontend/executors/__init__.py	`97.67% <100.00%> (+0.03%)`	⬆️
reframe/utility/__init__.py	`92.59% <0.00%> (-0.15%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Also: - Make the `filelog` handler respect the `systems/prefix` configuration parameter.

Also: - Do not crash if the log format is wrong, but issue an error message instead.

We no more use to different code paths for the old and new syntax. We simply convert the old syntax (`perf_patterns`) to the new one and we proceed in a single code path.

vkarak · 2022-10-19T19:33:21Z

@jenkins-cscs retry all

ekouts

I haven't gone through the PR very carefully but removing %(check_perf_lower_thres)s etc could be a problem for the data we send with HTTPJSONHandler, right? Since kibana rejects non-homegenous lists, how could we send the performance data there?

vkarak · 2022-10-20T19:04:27Z

I haven't gone through the PR very carefully but removing %(check_perf_lower_thres)s etc could be a problem for the data we send with HTTPJSONHandler, right?

This PR is indeed breaking and that's why it was scheduled for 4.0. The problem is that we don't log multiple records anymore, one for each variable, but we log everything at once. Could you check what the HTTPJSONHandler logs now and if that is by any means parsable from Elastic? If not, we could add an option to that handler to emit multiple records.

ekouts · 2022-10-21T08:58:42Z

This PR is indeed breaking and that's why it was scheduled for 4.0. The problem is that we don't log multiple records anymore, one for each variable, but we log everything at once. Could you check what the HTTPJSONHandler logs now and if that is by any means parsable from Elastic? If not, we could add an option to that handler to emit multiple records.

As far as I see there is not way to get the performance in httpjson since we have to skip the check_perfvalues.

Tried with your PR and this is json we are sending. We exclude check_perfvalues in the configuration to avoid the record being discarded:

{
  "check_name": "HelloWorldTestOpenMP %linking=static %lang=c",
  "check_jobid": "42267202",
  "check_job_completion_time": "2022-10-21T10:53:35+02:00",
  "check_job_completion_time_unix": 1666342415.0,
  "check_info": "HelloWorldTestOpenMP %linking=static %lang=c /19d97538 @daint:gpu+PrgEnv-cray",
  "check_system": "daint",
  "check_partition": "gpu",
  "check_environ": "PrgEnv-cray",
  "check_perf_var": null,
  "check_perf_value": null,
  "check_perf_ref": null,
  "check_perf_lower_thres": null,
  "check_perf_upper_thres": null,
  "check_perf_unit": null,
  "check_result": "pass",
  "osuser": "eirinik",
  "osgroup": "csstaff",
  "version": "4.0.0-dev.2",
  "check_job_exitcode": 0,
  "check_job_nodelist": [
    "nid02356"
  ],
  "check_build_locally": true,
  "check_build_time_limit": null,
  "check_descr": "C Hello, World OpenMP Static",
  "check_display_name": "HelloWorldTestOpenMP %linking=static %lang=c",
  "check_exclusive_access": true,
  "check_executable": "./HelloWorldTestOpenMP_3",
  "check_executable_opts": [],
  "check_extra_resources": {},
  "check_hashcode": "19d97538",
  "check_keep_files": [],
  "check_local": false,
  "check_maintainers": [
    "VH",
    "EK"
  ],
  "check_max_pending_time": null,
  "check_modules": [],
  "check_num_cpus_per_task": 4,
  "check_num_gpus_per_node": 0,
  "check_num_tasks": 1,
  "check_num_tasks_per_core": null,
  "check_num_tasks_per_node": 1,
  "check_num_tasks_per_socket": null,
  "check_outputdir": "/users/eirinik/repos/reframe-test/output/daint/gpu/PrgEnv-cray/HelloWorldTestOpenMP_19d97538",
  # "check_perfvalues": {
  #   "daint:gpu:compilation_time": [
  #    0.492619976,
  #     60,
  #     null,
  #     0.1,
  #     "s"
  #   ]
  # },
  "check_postbuild_cmds": [
    "_rfm_build_time=\"$(($(date +%s%N)-_rfm_build_time))\"",
    "echo \"Compilations time (ns): $_rfm_build_time\""
  ],
  "check_postrun_cmds": [],
  "check_prebuild_cmds": [
    "_rfm_build_time=\"$(date +%s%N)\""
  ],
  "check_prefix": "/users/eirinik/repos/cscs-reframe-tests/checks/prgenv",
  "check_prerun_cmds": [],
  "check_readonly_files": [],
  "check_short_name": "HelloWorldTestOpenMP_19d97538",
  "check_sourcepath": "hello_world_openmp.c",
  "check_sourcesdir": "src/openmp",
  "check_stagedir": "/users/eirinik/repos/reframe-test/stage/daint/gpu/PrgEnv-cray/HelloWorldTestOpenMP_19d97538",
  "check_strict_check": true,
  "check_tags": [
    "production",
    "craype"
  ],
  "check_time_limit": null,
  "check_unique_name": "HelloWorldTestOpenMP_3",
  "check_use_multithreading": null,
  "check_valid_prog_environs": [
    "PrgEnv-aocc",
    "PrgEnv-cray",
    "PrgEnv-cray_classic",
    "PrgEnv-gnu",
    "PrgEnv-intel",
    "PrgEnv-pgi",
    "PrgEnv-gnu-nocuda",
    "PrgEnv-pgi-nocuda"
  ],
  "check_valid_systems": [
    "daint:gpu",
    "daint:mc",
    "dom:gpu",
    "dom:mc",
    "arolla:cn",
    "arolla:pn",
    "tsa:cn",
    "tsa:pn"
  ],
  "check_variables": {
    "CRAYPE_LINK_TYPE": "static",
    "OMP_NUM_THREADS": "4"
  },
  "facility": "reframe",
  "data-version": "1.0"
}

I think we should add the option to that handler to emit multiple records or give the option to send check_perfvalues in a different way.

vkarak · 2022-10-28T07:20:40Z

@ekouts I think you are right. We should bring back the option of emitting multiple lines/records per test. I had a discussion over this PR with @flx42 as well and it seems that abolishing the multiline filelog will break existing post-processing workflows. I see the two options serving two different valid workflows and we should facilitate both after all. The multiline one enables quick filtering with grep as well as makes it more suitable for the httpjson handler as you mentioned, whereas the single line approach makes it easier to import all the results in csv to an external software that can do the post processing. I will work on bring back the possibility of multiline logging again.

This is disabled by default and can be enabled by setting the `perflog_compat` logging configuration parameter. This will restore the performance logging behaviour of pre-4.0 versions.

vkarak · 2022-11-23T23:57:08Z

@ekouts Could you retry this PR? To enable the old behaviour, you need to set perflog_compat in your logging configuration.

vkarak added prio: normal enhancement logging labels Oct 15, 2022

vkarak requested review from ekouts, teojgo and victorusu October 15, 2022 22:58

vkarak self-assigned this Oct 15, 2022

vkarak marked this pull request as draft October 15, 2022 23:37

vkarak added 6 commits October 16, 2022 23:44

WIP: Add unit tests for performance logging

2503e5f

Also: - Make the `filelog` handler respect the `systems/prefix` configuration parameter.

Add unit tests for performance logging

80c7341

Also: - Do not crash if the log format is wrong, but issue an error message instead.

Remove unused imports

bf3eacc

Simplify the implementation of check_performance

90fbd3e

We no more use to different code paths for the old and new syntax. We simply convert the old syntax (`perf_patterns`) to the new one and we proceed in a single code path.

Update docs

c3a8a12

Merge branch 'master' into feat/log-perfvars

86946a8

vkarak marked this pull request as ready for review October 18, 2022 21:00

ekouts requested changes Oct 20, 2022

View reviewed changes

Enable multiline performance logging

01892ea

This is disabled by default and can be enabled by setting the `perflog_compat` logging configuration parameter. This will restore the performance logging behaviour of pre-4.0 versions.

Merge branch 'master' into feat/log-perfvars

0a86a40

vkarak added this to the ReFrame sprint 22.12.1 milestone Nov 30, 2022

Update configuration schema with perflog_compat

f28c95f

ekouts approved these changes Dec 7, 2022

View reviewed changes

vkarak added 2 commits December 7, 2022 10:06

Merge branch 'master' into feat/log-perfvars

fee8f06

Merge branch 'master' into feat/log-perfvars

19fa4bf

vkarak merged commit 707ce8f into reframe-hpc:master Dec 7, 2022

vkarak deleted the feat/log-perfvars branch December 7, 2022 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Improve performance logging #2633

[feat] Improve performance logging #2633

Uh oh!

vkarak commented Oct 15, 2022 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 15, 2022

Uh oh!

vkarak commented Oct 19, 2022

Uh oh!

ekouts left a comment

Uh oh!

vkarak commented Oct 20, 2022

Uh oh!

ekouts commented Oct 21, 2022 •

edited

Loading

Uh oh!

vkarak commented Oct 28, 2022

Uh oh!

vkarak commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[feat] Improve performance logging #2633

[feat] Improve performance logging #2633

Uh oh!

Conversation

vkarak commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todos

Uh oh!

codecov-commenter commented Oct 15, 2022

Codecov Report

Uh oh!

vkarak commented Oct 19, 2022

Uh oh!

ekouts left a comment

Choose a reason for hiding this comment

Uh oh!

vkarak commented Oct 20, 2022

Uh oh!

ekouts commented Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkarak commented Oct 28, 2022

Uh oh!

vkarak commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vkarak commented Oct 15, 2022 •

edited

Loading

ekouts commented Oct 21, 2022 •

edited

Loading