STAR-57. Check if pid is alive while waiting for logs #724

tlasica · 2021-01-14T14:36:04Z

Rationale:
When starting C* ccm checks logs, waiting for particular phrases.
We want to fail test asap if there is a problem with node start.

Before:
If there is any problem with node start it will not happen and start() will hit timeout, by default 90s.
So the test will fail after 90s while it actually fail after 5s.

After:
If logs are streamed ccm will check if node pid is still alive.
If not it will immediately raise NodeError to fail test telling that node is not started correctly.

Note:
this may break some tests, that are testing node start failure by expecting TimeoutError.
Such tests should be adjusted, but there should be only a few.

8ba2709 is about some refactoring
8a960d9 is the change

tlasica · 2021-01-14T15:08:05Z

Kind of interesting that build fails on python 3.7.7

tlasica · 2021-01-14T15:11:45Z

ccmlib/node.py

+                    pid_before_check = self.pid
+                    self.__update_status()
+                    if not self.is_running():
+                        raise RuntimeError("node {n} with pid {pid} is not running".format(n=self.name, pid=pid_before_check))


I wonder if it can cause problems... in the node.stop()
it should not but there is a possible race between reading log and the moment pid disappears

tlasica · 2021-01-14T15:19:55Z

retest this please

tlasica · 2021-01-14T20:12:35Z

hi, may I ask why did you close this PR @michaelsembwever ?

michaelsembwever · 2021-01-14T21:09:20Z

That was unintentional.

It looks like when I deleted the cassandra-test branch all PRs pointing to it got automatically closed.

ref: https://issues.apache.org/jira/browse/CASSANDRA-16383

tlasica · 2021-01-15T10:28:15Z

@michaelsembwever I see, so I can just rebase branch to master ...

michaelsembwever · 2021-01-15T10:58:01Z

I can just rebase branch to master ...

Yes :)

tlasica · 2021-01-15T16:29:57Z

ccmlib/cluster.py

@@ -518,8 +516,7 @@ def start(self, no_wait=False, verbose=False, wait_for_binary_proto=True,

                # if the node is going to allocate_strategy_ tokens during start, then wait_for_binary_proto=True
                node_wait_for_binary_proto = (self.can_generate_tokens() and self.use_vnodes and node.initial_token is None)
-
-                p = node.start(update_pid=False, jvm_args=jvm_args, jvm_version=jvm_version,
+                p = node.start(update_pid=True, jvm_args=jvm_args, jvm_version=jvm_version,


this is tricky change... I am not 100% it is correct
I wonder why we could start with update_pid=False?

ccmlib/common.py

tlasica · 2021-01-15T16:31:13Z

ccmlib/node.py

@@ -558,21 +558,35 @@ def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbos
                            matchings.append((line, m))
                            tofind.remove(e)
                            if len(tofind) == 0:
+                                common.debug("return")


TO BE REMOVED

tlasica · 2021-01-15T16:35:10Z

ccmlib/node.py

+                        common.warning(msg)
+                        raise NodeError(msg)
+
+                # process is in this case process that started C*


This is another tricky part....

Possibly the intention of this part of the code was to actually finish immediately if child process terminated...
return None as not supper effective as results are not checked.

There is one interesting thing:
when checking None it can fail positive tests: test3 or test2 which is kind of unexpected.

tlasica · 2021-01-15T16:35:21Z

ccmlib/node.py

@@ -639,7 +653,14 @@ def wait_for_binary_interface(self, **kwargs):
        Emits a warning if not listening after NODE_WAIT_TIMEOUT_IN_SECS seconds.
        """
        if self.cluster.version() >= '1.2':
-            self.watch_log_for("Starting listening for CQL clients", **kwargs)
+            start_time = time.time()


TODO: remove logging

tlasica · 2021-01-15T16:43:04Z

ccmlib/node.py

@@ -1857,55 +1878,57 @@ def _update_topology_file(self):
        with open(topology_file, 'w') as f:
            f.write(content)

+    def _is_pid_running(self):


to extract is_pid_running()
and to make status updating logic simpler

tlasica · 2021-01-15T17:15:42Z

ccmlib/node.py

+                        common.warning(msg)
+                        raise NodeError(msg)
+
+                # this can preliminary end checking (if child process is terminated)


IIUC: if child process terminated => return None
but this None is not checked anywhere....
and when I added check some normal tests like start simply started to fail...

michaelsembwever

Changes LGTM. (thanks for the clean-up and the cleaner code alternatives).

I'm keen to see a full ci-cassandra run against this, when ready.
And what the heck update_pid does…?

tlasica · 2021-01-15T20:17:55Z

@michaelsembwever thanks!

the fix is incomplete or simply incorrect... I have run full cassandra-dtest suite and there are some problems, for example with tests that by design want to crash nodes on CRC problems. so it will have to be more subtle.

tlasica · 2021-01-25T16:00:41Z

ccmlib/node.py

@@ -630,17 +659,30 @@ def watch_log_for_alive(self, nodes, from_mark=None, timeout=120, filename='syst
        tofind = ["%s.* now UP" % node.address_for_version(self.get_cassandra_version()) for node in tofind]
        self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, filename=filename)

+    def raise_node_error_if_cassandra_process_is_terminated(self):
+        up = self._is_pid_running()
+        common.debug("checking if C* process with pid {pid} is running: {up}".format(pid=self.pid, up=up))


to be removed after testing

tlasica · 2021-01-26T17:55:31Z

@michaelsembwever This one is ready for re-review.

Would be great to have 2nd pair of eyes.

tlasica · 2021-01-26T17:56:40Z

Running some circleCI: https://app.circleci.com/pipelines/github/tlasica/cassandra?branch=ccm-star-57

There are two tests that need to catch both TimeoutError and NodeError (as technically both can be thrown).
In fact even before that change both can be thrown.

tlasica · 2021-01-27T21:13:23Z

both runs [some test adjustments I plan to add as CASSANDRA-14605] (apache/cassandra-dtest@trunk...tlasica:ccm-star-57) look pretty good to me
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/332/ against trunk
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/330/ against cassandra-3.11

jacek-lewandowski

what about https://github.com/riptano/ccm/pull/724/files#diff-07cfe037dd1d2ac633d6a2ee26160a2dc94bc7bb8ba4efe7c329cbf2192d8dbfR565 when the process exits with 0?

jacek-lewandowski · 2021-01-28T12:26:26Z

ccmlib/cluster.py

@@ -374,11 +369,14 @@ def balanced_tokens_across_dcs(self, dcs):
        tokens.extend(new_tokens)
        return tokens

+    def _more_than_one_token_configured(self):
+        num_tokens = self._config_options.get('num_tokens', None)


maybe just int(self._config_options.get('num_tokens', '1')) > 1

ccmlib/cluster.py

jacek-lewandowski · 2021-01-28T12:35:33Z

ccmlib/node.py

@@ -519,12 +528,19 @@ def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbos
        if len(tofind) == 0:
            return None

+        def check_timeout(msg):
+            if start + timeout < time.time():
+                tstamp = time.strftime("%d %b %Y %H:%M:%S", time.gmtime())


nit: time format could be extracted to some global / static / final field?

or maybe this whole function deserves to be top level - it looks pretty generic

jacek-lewandowski · 2021-01-28T12:40:27Z

ccmlib/node.py

-                        if process.returncode == 0:
-                            return None
+                    # check if abort condition is not met e.g. C* process is terminated
+                    if abort_function is not None:


why do we call it here?

ok, the logic of abort function mixed me up - by calling abort_function I conclude that we want to abort at this point. But it is more like maybe_abort... can be this more explicit? Maybe just pass a bool param - can_abort or something and then inline that cancellation code?

I was thinking about it as well... but the problem is this watch_for_logs is generic, used in tons of places.
So I wanted to have control about the condition that is used to abort...

I can try to change it again and instead of a function use a boolean: stop_if_cassandra_pid_terminated...

or just pass a condition

jacek-lewandowski · 2021-01-28T12:41:15Z

ccmlib/node.py

+                    # or if there is some race condition between log checking and start process finish
+                    # I am simply afraid to completely remove it because this is general usage method
+                    # so instead we will use it IFF the pid-based abort function is not provided
+                    if process and abort_function is None:


isn't abort_function is None redundant?

Not sure why redundant?
It is indeed checked earlier BUT it cannot not throw (it is conditional).
Indeed, I need to change naming ;-)

this is what makes this whole change so complex:
process - is a process that started C*
abort function is a function based e.g. on C* process pid (different one).

Now there are interesting races:
this process can be terminated before C* process is terminated.
in such case instead of finding C* process dead => raise NodeError
this logic will return None and continue
and then it will timeout on the next step or fail silently

jacek-lewandowski · 2021-01-28T12:51:02Z

ccmlib/node.py

-        # timeout. Other intlike types, though, we want to use.
-        if common.is_intlike(wait_for_binary_proto) and not isinstance(wait_for_binary_proto, bool):
+        # if requested wait for binary protocol to start
+        if common.is_int_not_bool(wait_for_binary_proto):


what about composing two variables

should_wait_for_binary_proto = wait_for_binary_proto is not None binary_proto_timeout = wait_for_binary_proto if is_int_not_bool(wait_for_binary_proto) else None

and then just call

if should_wait_for_binary_proto: self.wait_for_binary_interface(from_mark=self.mark, timeout=binary_proto_timeout)

this will not work straight because wait_for_binary_proto can be False...
it can be fixed, but then...

I will see if it can be done.

I would rather not touch it...

tlasica · 2021-01-28T21:01:28Z

Running new round of tests after recent changes:
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/338/ cassandra-3.11 🆗 [unrelated failures]
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/339/ trunk 🆗 [no dtest failures]
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/345/ cassandra-3.0 🆗 [unrelated failures]

https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/346/ cassandra-2.2
it has some test failures, but those look pretty much same as on
https://ci-cassandra.apache.org/job/Cassandra-2.2/41/testReport/

Before: If - for some reason - node will not start correctly, then watch_for_log would raise TimeoutError after specified timeout. By default this timeout is at least 90s, but can be specified by caller. After: If node will not start correctly then it will raise NodeError immediately after C* process terminates. Design: - before starting watch_for_log() make sure that C* process is running - during watching logs for "started" msg raise NodeError if C* process is terminated Final note: One of the major problems with this change is the contract for Node::start() and Cluster::start(). And contract for Node::watch_for_logs() Officialy it is: return some pair of match if found or raise TimeoutError. But it can also return None (unfortunately no clear contract when). Or raise RuntimeError. And the callers do not use return value at all. Sometimes even a RuntimeError is swallowed silently. This makes changing the behavior so fragile...

michaelsembwever · 2021-01-29T13:15:46Z

merged with 89af930

tlasica commented Jan 14, 2021

View reviewed changes

tlasica closed this Jan 14, 2021

tlasica reopened this Jan 14, 2021

michaelsembwever closed this Jan 14, 2021

michaelsembwever deleted the branch master January 14, 2021 18:34

michaelsembwever reopened this Jan 14, 2021

michaelsembwever changed the base branch from cassandra-test to master January 14, 2021 21:09

tlasica force-pushed the STAR-57 branch 2 times, most recently from 8943139 to 7bed88a Compare January 15, 2021 16:27

tlasica commented Jan 15, 2021

View reviewed changes

ccmlib/common.py Show resolved Hide resolved

tlasica commented Jan 15, 2021

View reviewed changes

tlasica marked this pull request as draft January 15, 2021 18:08

michaelsembwever reviewed Jan 15, 2021

View reviewed changes

tlasica force-pushed the STAR-57 branch 5 times, most recently from 7ae1648 to 7e35d1a Compare January 21, 2021 12:54

tlasica force-pushed the STAR-57 branch from 7e35d1a to 8a960d9 Compare January 21, 2021 13:05

tlasica marked this pull request as ready for review January 21, 2021 13:12

tlasica force-pushed the STAR-57 branch 3 times, most recently from 4b958c5 to 9be89b9 Compare January 25, 2021 14:40

tlasica commented Jan 25, 2021

View reviewed changes

tlasica force-pushed the STAR-57 branch 2 times, most recently from 6a78c88 to ac4359e Compare January 25, 2021 16:31

jacek-lewandowski reviewed Jan 28, 2021

View reviewed changes

tlasica force-pushed the STAR-57 branch from ab88e76 to ee5d7e7 Compare January 28, 2021 15:06

jacek-lewandowski approved these changes Jan 28, 2021

View reviewed changes

michaelsembwever approved these changes Jan 29, 2021

View reviewed changes

tlasica force-pushed the STAR-57 branch from ee5d7e7 to 3250869 Compare January 29, 2021 11:18

tlasica force-pushed the STAR-57 branch from 3250869 to b76a1dd Compare January 29, 2021 11:29

michaelsembwever closed this Jan 29, 2021

michaelsembwever deleted the STAR-57 branch March 4, 2022 20:13

STAR-57. Check if pid is alive while waiting for logs #724

STAR-57. Check if pid is alive while waiting for logs #724

Conversation

tlasica commented Jan 14, 2021 • edited

tlasica commented Jan 14, 2021

Choose a reason for hiding this comment

tlasica commented Jan 14, 2021

tlasica commented Jan 14, 2021

michaelsembwever commented Jan 14, 2021

tlasica commented Jan 15, 2021

michaelsembwever commented Jan 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlasica Jan 15, 2021 • edited

Choose a reason for hiding this comment

tlasica Jan 15, 2021 • edited

Choose a reason for hiding this comment

michaelsembwever left a comment

Choose a reason for hiding this comment

tlasica commented Jan 15, 2021

Choose a reason for hiding this comment

tlasica commented Jan 26, 2021

tlasica commented Jan 26, 2021

tlasica commented Jan 27, 2021

jacek-lewandowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlasica Jan 28, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlasica commented Jan 28, 2021 • edited

michaelsembwever commented Jan 29, 2021

tlasica commented Jan 14, 2021 •

edited

tlasica Jan 15, 2021 •

edited

tlasica Jan 15, 2021 •

edited

tlasica Jan 28, 2021 •

edited

tlasica commented Jan 28, 2021 •

edited