Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoggingThread should recover from temporary outage on hub #60

Closed
rohanpm opened this issue Oct 5, 2018 · 0 comments · Fixed by #248
Closed

LoggingThread should recover from temporary outage on hub #60

rohanpm opened this issue Oct 5, 2018 · 0 comments · Fixed by #248

Comments

@rohanpm
Copy link
Member

rohanpm commented Oct 5, 2018

When a task is running in a kobo worker, it's the responsibility of the LoggingThread to send log messages back to the hub.

This is done by the following loop:

    def run(self):
        """Send queue content to hub."""
        while self._running or not self._queue.empty() or self._send_data:
            if self._queue.empty():
                self._event.wait(5)

            self._event.clear()
            while True:
                try:
                    self._send_data += self._queue.get_nowait()
                except six.moves.queue.Empty:
                    break

            if not self._send_data:
                continue

            now = int(time.time())
            if self._running and len(self._send_data) < 1200 and now - self._send_time < 5:
                continue

            if isinstance(self._send_data, six.text_type):
                self._send_data = self._send_data.encode('utf-8')

            try:
                self._hub.upload_task_log(StringIO(self._send_data), self._task_id, "stdout.log", append=True)
                self._send_time = now
                self._send_data = ""
            except Fault:
                continue

Problem: if self._hub.upload_task_log raises an error (other than an XML-RPC Fault), then the logging thread simply stops. However, the main thread of the task being executed doesn't necessarily stop.

The end result of this is that if the kobo hub has a temporary outage while a task is in progress, that task might continue executing but all logs after the outage would be lost.

It would be better if the LoggingThread kept retrying the task log uploads, for as long as the task's main thread is alive.

Steps to reproduce

  • Start a long-running task
  • Take the kobo hub down temporarily during task execution

Actual behavior

  • Task may continue running but with no further log messages uploaded

Expected behavior

  • If task continues running, then it resumes uploading logs after kobo hub is restored.
rohanpm added a commit to rohanpm/kobo that referenced this issue Jan 4, 2019
Previously, LoggingThread would recover from any XML-RPC fault,
but would stop when any other type of exception was encountered.
That is a problem, as it means the worker will permanently give
up sending messages to the hub when all kinds of temporary
issues occur (e.g. a temporary network disruption between worker
and hub). The task underneath may continue running for hours,
with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt
recovering from *all* kinds of errors, as we should try hard not
to lose log messages from a task.

Fixes release-engineering#60
rohanpm added a commit to rohanpm/kobo that referenced this issue Jan 13, 2019
Previously, LoggingThread would recover from any XML-RPC fault,
but would stop when any other type of exception was encountered.
That is a problem, as it means the worker will permanently give
up sending messages to the hub when all kinds of temporary
issues occur (e.g. a temporary network disruption between worker
and hub). The task underneath may continue running for hours,
with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt
recovering from *all* kinds of errors, as we should try hard not
to lose log messages from a task.

Fixes release-engineering#60
rohanpm added a commit to rohanpm/kobo that referenced this issue Mar 9, 2022
When a task runs, LoggingThread is responsible for sending all task logs
to hub via XML-RPC. The handling of errors during this process was:

1) if an XML-RPC fault: retry an unlimited amount of times

2) if anything else: thread silently exits and logs stop forever

This commit aims to improve the behavior in case (2). If the
LoggingThread is about to die (which does happen sometimes in
practice), we should at least try logging the relevant exception to the
worker's local log file.

This relates to issue release-engineering#60 which suggests that case (2) should also
retry. That might still make sense, but I'm reluctant to have this retry
on all kinds of exceptions without first understanding what exceptions
can be hit in practice. So, let's first fix up the logging, then maybe
go back and adjust retry behavior later based on what we find.
crungehottman added a commit to crungehottman/kobo that referenced this issue Feb 6, 2024
Previously, LoggingThread would recover from any XML-RPC fault, but
would stop when any other type of exception was encountered.
That is a problem, as it means the worker will permanently give
up sending messages to the hub when all kinds of temporary
issues occur (e.g. a temporary network disruption between worker
and hub). The task underneath may continue running for hours,
with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt
recovering from all kinds of errors, as we should try hard not
to lose log messages from a task.

Fixes release-engineering#60

(This commit is a reimplementation of
release-engineering#106)
lzaoral pushed a commit to lzaoral/kobo that referenced this issue Mar 12, 2024
Previously, LoggingThread would recover from any XML-RPC fault, but
would stop when any other type of exception was encountered.
That is a problem, as it means the worker will permanently give
up sending messages to the hub when all kinds of temporary
issues occur (e.g. a temporary network disruption between worker
and hub). The task underneath may continue running for hours,
with all log messages being discarded.

Given the nature of this thread, it makes more sense to attempt
recovering from all kinds of errors, as we should try hard not
to lose log messages from a task.

Fixes release-engineering#60

(This commit is a reimplementation of
release-engineering#106)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant