Skip to content

Conversation

@vkarak
Copy link
Contributor

@vkarak vkarak commented Oct 15, 2022

This PR improves significantly how performance logging is done, as well as how the filelog handler behaves. More specifically:

  1. The performance of the test is no more logged from within the test inside the check_performance() method, but instead it is logged through the associated task from test executor after the test has finished. This allows us to log the final test result by introducing the new check_result log record attribute. We could also log any task-related information, but we can add this on-demand in the future.

  2. Performance logging through the filelog handler is completely revised:

    • All performance variables are logged at once in a single line.
    • A new option named format_perfvars is introduced for formatting the performance variables. This will be used to format the %(check_perfvalues)s specifier.
    • The handler will now emit a header line describing what each logged field is. This header is not generic, but it gets its information from the record itself, in order to obtain the name of the performance variables that are being logged.
    • To ensure consistency of the produced log files, if a header is changed in a subsequent run of the same test (e.g., the format or the performance variables changed), a new file will be generated and the old one will be saved.
    • The filelog handler in generic settings was adapted accordingly and it now produces pure csv output, so that it can be easily imported in other tools for visualization.
  3. If either the format or the perfvars_format are wrong, we don't crash anymore and we issue an error message as a log message.

  4. The performance checking phase has been revised and simplified. We do not maintain anymore two code paths for the legacy perf_patterns and the new syntax. We convert the old syntax to the new one and continue from there.

Users can still opt in for the old behaviour where a a new record was logged for each performance variable and the $(check_perf_*)s specifier was valid in each log record and not only for formatting the %(check_perfvalues)s. This behaviour can be enabled by setting the logging/perflog_compat configuration parameter to true.

Closes #1543.
Closes #2394.

Todos

  • Update the documentation
  • Update the configuration excerpts in the tutorial examples (if needed)
  • Check the FIXME note in pipeline.py
  • Add unit tests for the filelog handler

1. The performance of the test is no more logged from within the test inside the
   `check_performance()` method, but instead it is logged through the associated
   task from test executor _after_ the test has finished. This allows us to log
   also the final test result by introducing the new `check_result` test
   attribute.

2. Performance logging through the `filelog` handler is completely revised:
   - All performance variables are logged at once in a single line.
   - A new option named `format_perfvars` is introduced for formatting the
     performance variables. This will be used to format the
     `%(check_perfvalues)s` specifier.
   - The handler will now emit a header line describing what each logged field
     is. This header is not generic, but it gets its information from the record
     itself, in order to obtain the name of the performance variables that are
     being logged.
   - To ensure consistency of the produced log files, if a header is changed in
     a subsequent run of the same test (e.g., the format or the
     performance variables changed), a new file will be generated and the old
     one will be saved.
   - The `filelog` handler in generic settings was adapted accordingly and it
     now produces pure csv output, so that it can be easily imported in other
     tools for visualization.
@codecov-commenter
Copy link

Codecov Report

Base: 86.26% // Head: 86.15% // Decreases project coverage by -0.11% ⚠️

Coverage data is based on head (911a777) compared to base (db93224).
Patch coverage: 74.69% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2633      +/-   ##
==========================================
- Coverage   86.26%   86.15%   -0.12%     
==========================================
  Files          60       60              
  Lines       10985    11047      +62     
==========================================
+ Hits         9476     9517      +41     
- Misses       1509     1530      +21     
Impacted Files Coverage Δ
reframe/core/settings.py 100.00% <ø> (ø)
reframe/core/logging.py 75.58% <71.62%> (-1.43%) ⬇️
reframe/core/pipeline.py 93.35% <100.00%> (-0.01%) ⬇️
reframe/frontend/executors/__init__.py 97.67% <100.00%> (+0.03%) ⬆️
reframe/utility/__init__.py 92.59% <0.00%> (-0.15%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vkarak vkarak marked this pull request as draft October 15, 2022 23:37
Also:

- Make the `filelog` handler respect the `systems/prefix` configuration
  parameter.
Also:

- Do not crash if the log format is wrong, but issue an error message instead.
We no more use to different code paths for the old and new syntax. We simply
convert the old syntax (`perf_patterns`) to the new one and we proceed in a
single code path.
@vkarak vkarak marked this pull request as ready for review October 18, 2022 21:00
@vkarak
Copy link
Contributor Author

vkarak commented Oct 19, 2022

@jenkins-cscs retry all

Copy link
Contributor

@ekouts ekouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through the PR very carefully but removing %(check_perf_lower_thres)s etc could be a problem for the data we send with HTTPJSONHandler, right? Since kibana rejects non-homegenous lists, how could we send the performance data there?

@vkarak
Copy link
Contributor Author

vkarak commented Oct 20, 2022

I haven't gone through the PR very carefully but removing %(check_perf_lower_thres)s etc could be a problem for the data we send with HTTPJSONHandler, right?

This PR is indeed breaking and that's why it was scheduled for 4.0. The problem is that we don't log multiple records anymore, one for each variable, but we log everything at once. Could you check what the HTTPJSONHandler logs now and if that is by any means parsable from Elastic? If not, we could add an option to that handler to emit multiple records.

@ekouts
Copy link
Contributor

ekouts commented Oct 21, 2022

This PR is indeed breaking and that's why it was scheduled for 4.0. The problem is that we don't log multiple records anymore, one for each variable, but we log everything at once. Could you check what the HTTPJSONHandler logs now and if that is by any means parsable from Elastic? If not, we could add an option to that handler to emit multiple records.

As far as I see there is not way to get the performance in httpjson since we have to skip the check_perfvalues.

Tried with your PR and this is json we are sending. We exclude check_perfvalues in the configuration to avoid the record being discarded:

{
  "check_name": "HelloWorldTestOpenMP %linking=static %lang=c",
  "check_jobid": "42267202",
  "check_job_completion_time": "2022-10-21T10:53:35+02:00",
  "check_job_completion_time_unix": 1666342415.0,
  "check_info": "HelloWorldTestOpenMP %linking=static %lang=c /19d97538 @daint:gpu+PrgEnv-cray",
  "check_system": "daint",
  "check_partition": "gpu",
  "check_environ": "PrgEnv-cray",
  "check_perf_var": null,
  "check_perf_value": null,
  "check_perf_ref": null,
  "check_perf_lower_thres": null,
  "check_perf_upper_thres": null,
  "check_perf_unit": null,
  "check_result": "pass",
  "osuser": "eirinik",
  "osgroup": "csstaff",
  "version": "4.0.0-dev.2",
  "check_job_exitcode": 0,
  "check_job_nodelist": [
    "nid02356"
  ],
  "check_build_locally": true,
  "check_build_time_limit": null,
  "check_descr": "C Hello, World OpenMP Static",
  "check_display_name": "HelloWorldTestOpenMP %linking=static %lang=c",
  "check_exclusive_access": true,
  "check_executable": "./HelloWorldTestOpenMP_3",
  "check_executable_opts": [],
  "check_extra_resources": {},
  "check_hashcode": "19d97538",
  "check_keep_files": [],
  "check_local": false,
  "check_maintainers": [
    "VH",
    "EK"
  ],
  "check_max_pending_time": null,
  "check_modules": [],
  "check_num_cpus_per_task": 4,
  "check_num_gpus_per_node": 0,
  "check_num_tasks": 1,
  "check_num_tasks_per_core": null,
  "check_num_tasks_per_node": 1,
  "check_num_tasks_per_socket": null,
  "check_outputdir": "/users/eirinik/repos/reframe-test/output/daint/gpu/PrgEnv-cray/HelloWorldTestOpenMP_19d97538",
  # "check_perfvalues": {
  #   "daint:gpu:compilation_time": [
  #    0.492619976,
  #     60,
  #     null,
  #     0.1,
  #     "s"
  #   ]
  # },
  "check_postbuild_cmds": [
    "_rfm_build_time=\"$(($(date +%s%N)-_rfm_build_time))\"",
    "echo \"Compilations time (ns): $_rfm_build_time\""
  ],
  "check_postrun_cmds": [],
  "check_prebuild_cmds": [
    "_rfm_build_time=\"$(date +%s%N)\""
  ],
  "check_prefix": "/users/eirinik/repos/cscs-reframe-tests/checks/prgenv",
  "check_prerun_cmds": [],
  "check_readonly_files": [],
  "check_short_name": "HelloWorldTestOpenMP_19d97538",
  "check_sourcepath": "hello_world_openmp.c",
  "check_sourcesdir": "src/openmp",
  "check_stagedir": "/users/eirinik/repos/reframe-test/stage/daint/gpu/PrgEnv-cray/HelloWorldTestOpenMP_19d97538",
  "check_strict_check": true,
  "check_tags": [
    "production",
    "craype"
  ],
  "check_time_limit": null,
  "check_unique_name": "HelloWorldTestOpenMP_3",
  "check_use_multithreading": null,
  "check_valid_prog_environs": [
    "PrgEnv-aocc",
    "PrgEnv-cray",
    "PrgEnv-cray_classic",
    "PrgEnv-gnu",
    "PrgEnv-intel",
    "PrgEnv-pgi",
    "PrgEnv-gnu-nocuda",
    "PrgEnv-pgi-nocuda"
  ],
  "check_valid_systems": [
    "daint:gpu",
    "daint:mc",
    "dom:gpu",
    "dom:mc",
    "arolla:cn",
    "arolla:pn",
    "tsa:cn",
    "tsa:pn"
  ],
  "check_variables": {
    "CRAYPE_LINK_TYPE": "static",
    "OMP_NUM_THREADS": "4"
  },
  "facility": "reframe",
  "data-version": "1.0"
}

I think we should add the option to that handler to emit multiple records or give the option to send check_perfvalues in a different way.

@vkarak
Copy link
Contributor Author

vkarak commented Oct 28, 2022

@ekouts I think you are right. We should bring back the option of emitting multiple lines/records per test. I had a discussion over this PR with @flx42 as well and it seems that abolishing the multiline filelog will break existing post-processing workflows. I see the two options serving two different valid workflows and we should facilitate both after all. The multiline one enables quick filtering with grep as well as makes it more suitable for the httpjson handler as you mentioned, whereas the single line approach makes it easier to import all the results in csv to an external software that can do the post processing. I will work on bring back the possibility of multiline logging again.

This is disabled by default and can be enabled by setting the `perflog_compat`
logging configuration parameter. This will restore the performance logging
behaviour of pre-4.0 versions.
@vkarak
Copy link
Contributor Author

vkarak commented Nov 23, 2022

@ekouts Could you retry this PR? To enable the old behaviour, you need to set perflog_compat in your logging configuration.

@vkarak vkarak added this to the ReFrame sprint 22.12.1 milestone Nov 30, 2022
@vkarak vkarak merged commit 707ce8f into reframe-hpc:master Dec 7, 2022
@vkarak vkarak deleted the feat/log-perfvars branch December 7, 2022 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider logging the full test result instead of just the performance Allow multiple perf_patterns on one line in handlers_perflog

3 participants