Skip to content

Conversation

@victorusu
Copy link
Contributor

@victorusu victorusu commented Mar 21, 2018

EDIT by @vkarak

This PR, along with the support for Graylog performance logging, brings a redesign of the performance logging:

  • New syntax for logging handler configuration.
    The new syntax is more like JSON.
    A warning is issued if the old syntax is used and it is automatically
    converted to the new.

  • Logging fields for checks are enriched with performance information

  • A special logging handler is introduced that dynamically creates a
    file based on live information of the first record to be logged.
    This handler is used for file performance logging, where each
    performance test must have its own performance file.

  • No more properties in settings, because requires some programming
    skills to edit and adapt to future settings changes.

Still todo:

  • Update documentation

The following description is now obsolete

This PR adds the support to log the performance checks that do use using
graylog.

The implementation adds the following fields in graylog:

  • check_info which contains the check name the system:partition and PrgEnv
  • check_name which shows the check name
  • check_partition which displays the system partition used
  • check_perf_reference which shows the reference value of a given test
  • check_perf_upper_thres which displays the upper limit threshold of a test
  • check_perf_lower_thres which displays the lower limit threshold of a check
  • check_perf_value which holds the actual perf value recorded on that system
  • check_system which shows the system on which the test has ran
  • data-version which display the reframe version of the data
  • version which displays the reframe version

This PR adds the support to log the performance checks that do use using
graylog.

The implementation adds the following fields in graylog:

* check_info which contains the check name the system:partition and
PrgEnv
* check_name which shows the check name
* check_partition which displays the system partition used
* check_perf_reference which shows the reference value of a given test
* check_perf_upper_thres which displays the upper limit threshold of
a test
* check_perf_lower_thres which displays the lower limit threshold of a
check
* check_perf_value which holds the actual perf value recorded on that
system
* check_system which shows the system on which the test has ran
* data-version which display the reframe version of the data
* version which displays the reframe version
@victorusu victorusu requested a review from vkarak March 21, 2018 13:02
@victorusu victorusu self-assigned this Mar 21, 2018
@victorusu victorusu added this to the ReFrame sprint 2018w11 milestone Mar 21, 2018
@vkarak vkarak modified the milestones: ReFrame sprint 2018w11, Upcoming sprint Mar 26, 2018
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that performance logging is configurable, it makes sense to add also unit tests to test what gets printed in the log file.

@@ -1 +1 @@
../config/generic.py No newline at end of file
../config/cscs.py No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this.

config/cscs.py Outdated
_perf_logging_config = {
'level': 'INFO',
'handlers': {
'__h_graylog': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have the handlers as a list and each handler entry should rather have a type attribute specifying is purpose. So the handlers entry would look like this:

'handlers': [
    {
        'type': 'graylog',
        ...
    },
    {
        'type': 'file',
        ...
   }
]

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victorusu The problem with this syntax is that we should change also the standard logging syntax...


import reframe
import reframe.core.debug as debug
from reframe.settings import settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now obsolete. You should get the settings from the config.py using the corresponding function.

def getlogger():
return _context_logger

def getperflogger(check):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may omit completely the check argument and get the actual check from the current logger returned from getlogger(). This logger will be associated with the currently executing check. If there is no check associated, this method should throw.

perf_extra['check_perf_lower_thres'] = low_thres

if high_thres is not None:
perf_extra['check_perf_upper_thres'] = high_thres
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember, the Graylog backend needs some arguments passed in the extra argument? Could you please remind me what is exactly needed? It seems we need to fine tune a bit this here.

@vkarak
Copy link
Contributor

vkarak commented Apr 3, 2018

@victorusu Please update also this branch with the latest master, cos it's quite behind.

@vkarak vkarak modified the milestones: ReFrame sprint 2018w13, Upcoming sprint Apr 15, 2018
@vkarak vkarak modified the milestones: ReFrame sprint 2018w16, Upcoming sprint May 3, 2018
@vkarak vkarak self-assigned this May 3, 2018
Vasileios Karakasis added 2 commits May 24, 2018 09:57
This commit brings several changes:

- New syntax for logging handler configuration.
  The new syntax is more like JSON.
  A warning is issued if the old syntax is used and it is automatically
  converted to the new.

- Logging fields for checks are enriched with performance information

- A special logging handler is introduced that dynamically creates a
  file based on live information of the first record to be logged.
  This handler is used for file performance logging, where each
  performance test must have its own performance file.

- No more properties in settings, because requires some programming
  skills to edit and adapt to future settings changes.
@vkarak vkarak requested a review from teojgo May 25, 2018 15:39
@vkarak vkarak changed the title Adding support graylog performance logging [WIP] Redesign of performance logging and support for Graylog May 25, 2018
job_submit_timeout = 60
checks_path = ['checks/']
checks_path_recurse = True
site_configuration = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all these attributes public and removed properties. See the PR description for a rationale. Opinions?

'%(check_perf_var)s=%(check_perf_value)s|'
'ref=%(check_perf_ref)s '
'(l=%(check_perf_lower_thres)s, '
'u=%(check_perf_upper_thres)s)'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the produced performance log. Let me know if you like it:

2018-05-25T17:17:40|reframe 2.13-dev0|magma_zsymmetrize_prod on dom:gpu using PrgEnv-gnu|jobid=746609|cpu_perf=0.916756|ref=0.93 (l=None, u=0.05)
2018-05-25T17:17:40|reframe 2.13-dev0|magma_zsymmetrize_prod on dom:gpu using PrgEnv-gnu|jobid=746609|gpu_perf=158.9385|ref=158.4 (l=None, u=0.05)

import reframe
import reframe.core.debug as debug
from reframe.core.exceptions import ConfigError, LoggingError
from reframe.settings import settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to remove this.

'check_perf_ref': None,
'check_perf_lower_thres': None,
'check_perf_upper_thres': None,
'data-version': reframe.VERSION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not using data-version, but perhaps I could. @victorusu What is the intention for that?

hdlr = _create_graylog_handler(handler_config)
if hdlr is None:
sys.stderr.write('WARNING: could not initialize the '
'graylog handler; ignoring...\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting this warning, which means that Python-bare packages does not contain it. We should add it there.

def process(self, msg, kwargs):
# Setup dynamic fields of the check
def _update_check_extras(self):
"""Return a dictionary with all the check-specific information."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is obsolete.

# Performance logging
self._perf_logger = logging.null_logger
self._perf_logfile = None
self._perf_logdir = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used.

try:
logging.configure_logging(settings.logging_config)
logging.configure_logging(settings.logging_config,
settings.perf_logging_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will through an AttributeError if somebody tries to use this version with an old configuration file. I could perhaps catch it and issue a warning message. It would not be a problem if I was passing it as None, if the attribute didn't exist, but this would silently make the framework not to produce performance logs.

captured_stdout = StringIO()
captured_stderr = StringIO()
print(sys.argv)
print(' '.join(sys.argv))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should remove this.

@vkarak
Copy link
Contributor

vkarak commented May 25, 2018

@victorusu Can you review and reply to my comments?

PS: Technically, you cannot review, because you submitted this PR, but just make your comments.

Performance logging logic (log files, log directory creatation) is now
completely part of the performance logging configuration. More
specifically, the performance log directory and files are only relevant
to the `filelog` backend logging handler. For this reason, the logic of
creating the logging prefix is now removed from the runtime resources as
well.

self.baseFilename = os.path.join(dirname, record.check_name + '.log')
self.stream = self._streams.get(self.baseFilename, None)
self._streams[self.baseFilename] = self.stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go after the super().emit(...)

- Also some fine-tuning of the Graylog backend impl.
@vkarak vkarak changed the title [WIP] Redesign of performance logging and support for Graylog Redesign of performance logging and support for Graylog Jun 4, 2018
@codecov-io
Copy link

codecov-io commented Jun 7, 2018

Codecov Report

Merging #213 into master will decrease coverage by 0.17%.
The diff coverage is 82.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #213      +/-   ##
==========================================
- Coverage    91.3%   91.13%   -0.18%     
==========================================
  Files          68       68              
  Lines        8107     8231     +124     
==========================================
+ Hits         7402     7501      +99     
- Misses        705      730      +25
Impacted Files Coverage Δ
unittests/test_fields.py 100% <ø> (ø) ⬆️
reframe/core/runtime.py 87.39% <ø> (-0.01%) ⬇️
unittests/test_logging.py 98.48% <100%> (+0.19%) ⬆️
reframe/settings.py 100% <100%> (+32%) ⬆️
reframe/core/exceptions.py 82.17% <100%> (+0.13%) ⬆️
unittests/resources/settings.py 100% <100%> (+12%) ⬆️
reframe/core/pipeline.py 94.7% <100%> (-0.02%) ⬇️
reframe/frontend/cli.py 78.57% <57.89%> (-1.85%) ⬇️
reframe/frontend/printer.py 86.66% <66.66%> (-2.07%) ⬇️
reframe/core/logging.py 84.19% <79.45%> (-5.36%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bdf2090...71f4e30. Read the comment docs.

config/cscs.py Outdated
'handlers': [
{
'type': 'graylog',
'host': 'your-sever-here',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to your-server-here

docs/running.rst Outdated
By default this field has the form ``<check_name> on <current_partition> using <current_environment>``.
It can be configured on a per test basis by overriding the :func:`info <reframe.core.pipeline.RegressionTest.info>` method in your regression test.
It can be configured on a per test basis by overriding the :func:`info <reframe.core.pipeline.RegressionTest.info>` method of a specific regression test.
- ``check_jobid``: Prints the job or process id of the job or process associated with currently executing regression test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to ... associated with the currently executing regression test

docs/running.rst Outdated
- ``check_outputdir``: The output directory associated with the currently executing test.
- ``check_partition``: The system partition a test is currently executing.
- ``check_stagedir``: The stage directory associated with the currently executing test.
- ``check_system``: The host system a test is currently executing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chage to The host system on which a test is currently executing.

Copy link
Contributor

@vkarak vkarak Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it to "The host system where this test is currently executing."

docs/running.rst Outdated
"""""""""""""""""""""""""""""""""

The type of this handler is ``graylog`` and it logs performance data to a `Graylog <https://www.graylog.org/>`__ server.
Graylog is distributed enterprise log management service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to Graylog is a distributed ...

docs/running.rst Outdated
* ``extras``: (optional) A set of optional user attributes to be passed with each log record to the server.
These may depend on the server configuration.

This log handler uses internally `pygelf <https://pypi.org/project/pygelf/>`__, so this module Python must be available, otherwise this log handler will be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to ... so this Python module must be available

@teojgo
Copy link
Contributor

teojgo commented Jun 13, 2018

Except for those minor documentation changes it looks fine to me.

@vkarak
Copy link
Contributor

vkarak commented Jun 14, 2018

@victorusu Did you have time to check if Graylog backends actually works?

@victorusu
Copy link
Contributor Author

@vkarak, I am getting the following error:

FAILURE INFO for gromacs_gpu_prod_check 
  * System partition: daint:gpu
  * Environment: PrgEnv-gnu
  * Stage directory: /users/hvictor/Work/grafana-test/reframe/stage/gpu/gromacs_gpu_prod_check/PrgEnv-gnu
  * Job type: batch job (id=8060458)
  * Maintainers: ['VH']
  * Failing phase: performance
  * Reason: unexpected error: Object of type 'set' is not JSON serializable
Traceback (most recent call last):
  File "/users/hvictor/Work/grafana-test/reframe/reframe/frontend/executors/__init__.py", line 58, in _safe_call
    return fn(*args, **kwargs)
  File "/users/hvictor/Work/grafana-test/reframe/reframe/core/pipeline.py", line 1003, in performance
    self.check_performance()
  File "/users/hvictor/Work/grafana-test/reframe/reframe/core/pipeline.py", line 1042, in check_performance
    ref, low_thres, high_thres)
  File "/users/hvictor/Work/grafana-test/reframe/reframe/core/logging.py", line 390, in log_performance
    self.log(level, msg)
  File "/users/hvictor/Work/grafana-test/reframe/reframe/core/logging.py", line 405, in log
    super().log(level, msg, *args, **kwargs)
  File "/opt/python/3.6.1.1/lib/python3.6/logging/__init__.py", line 1672, in log
    self.logger._log(level, msg, args, **kwargs)
  File "/opt/python/3.6.1.1/lib/python3.6/logging/__init__.py", line 1442, in _log
    self.handle(record)
  File "/opt/python/3.6.1.1/lib/python3.6/logging/__init__.py", line 1452, in handle
    self.callHandlers(record)
  File "/opt/python/3.6.1.1/lib/python3.6/logging/__init__.py", line 1514, in callHandlers
    hdlr.handle(record)
  File "/opt/python/3.6.1.1/lib/python3.6/logging/__init__.py", line 863, in handle
    self.emit(record)
  File "/apps/daint/UES/jenkins/6.0.UP04/gpu/easybuild/software/PyExtensions/3.6-CrayGNU-17.12/lib/python3.6/site-packages/pygelf-0.3.1-py3.6.egg/pygelf/handlers.py", line 156, in emit
    data = self.convert_record_to_gelf(record)
  File "/apps/daint/UES/jenkins/6.0.UP04/gpu/easybuild/software/PyExtensions/3.6-CrayGNU-17.12/lib/python3.6/site-packages/pygelf-0.3.1-py3.6.egg/pygelf/handlers.py", line 35, in convert_record_to_gelf
    self.compress
  File "/apps/daint/UES/jenkins/6.0.UP04/gpu/easybuild/software/PyExtensions/3.6-CrayGNU-17.12/lib/python3.6/site-packages/pygelf-0.3.1-py3.6.egg/pygelf/gelf.py", line 66, in pack
    packed = json.dumps(gelf).encode('utf-8')
  File "/opt/python/3.6.1.1/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/opt/python/3.6.1.1/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/python/3.6.1.1/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/python/3.6.1.1/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'set' is not JSON serializable

@vkarak
Copy link
Contributor

vkarak commented Jun 18, 2018

@victorusu The problem seems to come from here. Can you try converting tags to a list in this point and retry?

@vkarak
Copy link
Contributor

vkarak commented Jun 18, 2018

@victorusu Or just call str() on it.

- A log message is required for the server to accept the log, so we
  generate a default `sent by $USER` message, if not specified.

Also:
- Fix how check tags are logged.
- Fix output of tags in check listing.
@vkarak
Copy link
Contributor

vkarak commented Jun 19, 2018

@teojgo Can you reiterate over this PR and approve it? It's now final and ready to be merged. Graylog backend works also fine.

@vkarak
Copy link
Contributor

vkarak commented Jun 20, 2018

@jenkins-cscs retry dom

Also:
- Adapted PBS config file
- PrettyPreinter prints colored messages for warnings and errors.
- Simplify message format for Graylog handler.
@vkarak vkarak merged commit 4185d5f into master Jun 20, 2018
@vkarak vkarak deleted the feature/graylog-perf-support branch June 20, 2018 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants