LoggingThread should recover from temporary outage on hub #60

rohanpm · 2018-10-05T06:44:58Z

When a task is running in a kobo worker, it's the responsibility of the LoggingThread to send log messages back to the hub.

This is done by the following loop:

    def run(self):
        """Send queue content to hub."""
        while self._running or not self._queue.empty() or self._send_data:
            if self._queue.empty():
                self._event.wait(5)

            self._event.clear()
            while True:
                try:
                    self._send_data += self._queue.get_nowait()
                except six.moves.queue.Empty:
                    break

            if not self._send_data:
                continue

            now = int(time.time())
            if self._running and len(self._send_data) < 1200 and now - self._send_time < 5:
                continue

            if isinstance(self._send_data, six.text_type):
                self._send_data = self._send_data.encode('utf-8')

            try:
                self._hub.upload_task_log(StringIO(self._send_data), self._task_id, "stdout.log", append=True)
                self._send_time = now
                self._send_data = ""
            except Fault:
                continue

Problem: if self._hub.upload_task_log raises an error (other than an XML-RPC Fault), then the logging thread simply stops. However, the main thread of the task being executed doesn't necessarily stop.

The end result of this is that if the kobo hub has a temporary outage while a task is in progress, that task might continue executing but all logs after the outage would be lost.

It would be better if the LoggingThread kept retrying the task log uploads, for as long as the task's main thread is alive.

Steps to reproduce

Start a long-running task
Take the kobo hub down temporarily during task execution

Actual behavior

Task may continue running but with no further log messages uploaded

Expected behavior

If task continues running, then it resumes uploading logs after kobo hub is restored.

The text was updated successfully, but these errors were encountered:

Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded. Given the nature of this thread, it makes more sense to attempt recovering from *all* kinds of errors, as we should try hard not to lose log messages from a task. Fixes release-engineering#60

When a task runs, LoggingThread is responsible for sending all task logs to hub via XML-RPC. The handling of errors during this process was: 1) if an XML-RPC fault: retry an unlimited amount of times 2) if anything else: thread silently exits and logs stop forever This commit aims to improve the behavior in case (2). If the LoggingThread is about to die (which does happen sometimes in practice), we should at least try logging the relevant exception to the worker's local log file. This relates to issue release-engineering#60 which suggests that case (2) should also retry. That might still make sense, but I'm reluctant to have this retry on all kinds of exceptions without first understanding what exceptions can be hit in practice. So, let's first fix up the logging, then maybe go back and adjust retry behavior later based on what we find.

Previously, LoggingThread would recover from any XML-RPC fault, but would stop when any other type of exception was encountered. That is a problem, as it means the worker will permanently give up sending messages to the hub when all kinds of temporary issues occur (e.g. a temporary network disruption between worker and hub). The task underneath may continue running for hours, with all log messages being discarded. Given the nature of this thread, it makes more sense to attempt recovering from all kinds of errors, as we should try hard not to lose log messages from a task. Fixes release-engineering#60 (This commit is a reimplementation of release-engineering#106)

rohanpm mentioned this issue Jan 4, 2019

Make LoggingThread recover on all errors #106

Closed

rohanpm mentioned this issue Mar 9, 2022

Ensure LoggingThread fatal errors are logged #169

Merged

lzaoral mentioned this issue Nov 13, 2023

hub: exceptions in upload_task_log pause worker logging openscanhub/openscanhub#101

Closed

crungehottman mentioned this issue Feb 6, 2024

Make LoggingThread recover on all errors [RHELDST-332] #248

Merged

crungehottman closed this as completed in #248 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoggingThread should recover from temporary outage on hub #60

LoggingThread should recover from temporary outage on hub #60

rohanpm commented Oct 5, 2018

LoggingThread should recover from temporary outage on hub #60

LoggingThread should recover from temporary outage on hub #60

Comments

rohanpm commented Oct 5, 2018

Steps to reproduce

Actual behavior

Expected behavior