New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STAR-57. Check if pid is alive while waiting for logs #724
Conversation
Kind of interesting that build fails on python 3.7.7 |
ccmlib/node.py
Outdated
pid_before_check = self.pid | ||
self.__update_status() | ||
if not self.is_running(): | ||
raise RuntimeError("node {n} with pid {pid} is not running".format(n=self.name, pid=pid_before_check)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it can cause problems... in the node.stop()
it should not but there is a possible race between reading log and the moment pid disappears
retest this please |
hi, may I ask why did you close this PR @michaelsembwever ? |
That was unintentional. It looks like when I deleted the |
@michaelsembwever I see, so I can just rebase branch to master ... |
Yes :) |
8943139
to
7bed88a
Compare
ccmlib/cluster.py
Outdated
@@ -518,8 +516,7 @@ def start(self, no_wait=False, verbose=False, wait_for_binary_proto=True, | |||
|
|||
# if the node is going to allocate_strategy_ tokens during start, then wait_for_binary_proto=True | |||
node_wait_for_binary_proto = (self.can_generate_tokens() and self.use_vnodes and node.initial_token is None) | |||
|
|||
p = node.start(update_pid=False, jvm_args=jvm_args, jvm_version=jvm_version, | |||
p = node.start(update_pid=True, jvm_args=jvm_args, jvm_version=jvm_version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is tricky change... I am not 100% it is correct
I wonder why we could start with update_pid=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷
ccmlib/node.py
Outdated
@@ -558,21 +558,35 @@ def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbos | |||
matchings.append((line, m)) | |||
tofind.remove(e) | |||
if len(tofind) == 0: | |||
common.debug("return") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TO BE REMOVED
ccmlib/node.py
Outdated
common.warning(msg) | ||
raise NodeError(msg) | ||
|
||
# process is in this case process that started C* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another tricky part....
Possibly the intention of this part of the code was to actually finish immediately if child process terminated...
return None
as not supper effective as results are not checked.
There is one interesting thing:
when checking None
it can fail positive tests: test3
or test2
which is kind of unexpected.
ccmlib/node.py
Outdated
@@ -639,7 +653,14 @@ def wait_for_binary_interface(self, **kwargs): | |||
Emits a warning if not listening after NODE_WAIT_TIMEOUT_IN_SECS seconds. | |||
""" | |||
if self.cluster.version() >= '1.2': | |||
self.watch_log_for("Starting listening for CQL clients", **kwargs) | |||
start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove logging
@@ -1857,55 +1878,57 @@ def _update_topology_file(self): | |||
with open(topology_file, 'w') as f: | |||
f.write(content) | |||
|
|||
def _is_pid_running(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to extract is_pid_running()
and to make status updating logic simpler
ccmlib/node.py
Outdated
common.warning(msg) | ||
raise NodeError(msg) | ||
|
||
# this can preliminary end checking (if child process is terminated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC: if child process terminated => return None
but this None is not checked anywhere....
and when I added check some normal tests like start simply started to fail...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. (thanks for the clean-up and the cleaner code alternatives).
I'm keen to see a full ci-cassandra run against this, when ready.
And what the heck update_pid
does…?
@michaelsembwever thanks! the fix is incomplete or simply incorrect... I have run full cassandra-dtest suite and there are some problems, for example with tests that by design want to crash nodes on CRC problems. so it will have to be more subtle. |
7ae1648
to
7e35d1a
Compare
4b958c5
to
9be89b9
Compare
ccmlib/node.py
Outdated
@@ -630,17 +659,30 @@ def watch_log_for_alive(self, nodes, from_mark=None, timeout=120, filename='syst | |||
tofind = ["%s.* now UP" % node.address_for_version(self.get_cassandra_version()) for node in tofind] | |||
self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, filename=filename) | |||
|
|||
def raise_node_error_if_cassandra_process_is_terminated(self): | |||
up = self._is_pid_running() | |||
common.debug("checking if C* process with pid {pid} is running: {up}".format(pid=self.pid, up=up)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be removed after testing
6a78c88
to
ac4359e
Compare
@michaelsembwever This one is ready for re-review. Would be great to have 2nd pair of eyes. |
Running some circleCI: https://app.circleci.com/pipelines/github/tlasica/cassandra?branch=ccm-star-57 There are two tests that need to catch both TimeoutError and NodeError (as technically both can be thrown). |
both runs [some test adjustments I plan to add as CASSANDRA-14605] (apache/cassandra-dtest@trunk...tlasica:ccm-star-57) look pretty good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about https://github.com/riptano/ccm/pull/724/files#diff-07cfe037dd1d2ac633d6a2ee26160a2dc94bc7bb8ba4efe7c329cbf2192d8dbfR565 when the process exits with 0?
ccmlib/cluster.py
Outdated
@@ -374,11 +369,14 @@ def balanced_tokens_across_dcs(self, dcs): | |||
tokens.extend(new_tokens) | |||
return tokens | |||
|
|||
def _more_than_one_token_configured(self): | |||
num_tokens = self._config_options.get('num_tokens', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just int(self._config_options.get('num_tokens', '1')) > 1
ccmlib/node.py
Outdated
@@ -519,12 +528,19 @@ def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbos | |||
if len(tofind) == 0: | |||
return None | |||
|
|||
def check_timeout(msg): | |||
if start + timeout < time.time(): | |||
tstamp = time.strftime("%d %b %Y %H:%M:%S", time.gmtime()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: time format could be extracted to some global / static / final field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe this whole function deserves to be top level - it looks pretty generic
ccmlib/node.py
Outdated
if process.returncode == 0: | ||
return None | ||
# check if abort condition is not met e.g. C* process is terminated | ||
if abort_function is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we call it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the logic of abort function mixed me up - by calling abort_function
I conclude that we want to abort at this point. But it is more like maybe_abort... can be this more explicit? Maybe just pass a bool param - can_abort or something and then inline that cancellation code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it as well... but the problem is this watch_for_logs
is generic, used in tons of places.
So I wanted to have control about the condition that is used to abort...
I can try to change it again and instead of a function use a boolean: stop_if_cassandra_pid_terminated
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just pass a condition
ccmlib/node.py
Outdated
# or if there is some race condition between log checking and start process finish | ||
# I am simply afraid to completely remove it because this is general usage method | ||
# so instead we will use it IFF the pid-based abort function is not provided | ||
if process and abort_function is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't abort_function is None
redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why redundant?
It is indeed checked earlier BUT it cannot not throw (it is conditional).
Indeed, I need to change naming ;-)
this is what makes this whole change so complex:
process - is a process that started C*
abort function is a function based e.g. on C* process pid (different one).
Now there are interesting races:
this process can be terminated before C* process is terminated.
in such case instead of finding C* process dead => raise NodeError
this logic will return None and continue
and then it will timeout on the next step or fail silently
# timeout. Other intlike types, though, we want to use. | ||
if common.is_intlike(wait_for_binary_proto) and not isinstance(wait_for_binary_proto, bool): | ||
# if requested wait for binary protocol to start | ||
if common.is_int_not_bool(wait_for_binary_proto): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about composing two variables
should_wait_for_binary_proto = wait_for_binary_proto is not None
binary_proto_timeout = wait_for_binary_proto if is_int_not_bool(wait_for_binary_proto) else None
and then just call
if should_wait_for_binary_proto:
self.wait_for_binary_interface(from_mark=self.mark, timeout=binary_proto_timeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not work straight because wait_for_binary_proto
can be False
...
it can be fixed, but then...
I will see if it can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not touch it...
Running new round of tests after recent changes: https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/346/ cassandra-2.2 |
Before: If - for some reason - node will not start correctly, then watch_for_log would raise TimeoutError after specified timeout. By default this timeout is at least 90s, but can be specified by caller. After: If node will not start correctly then it will raise NodeError immediately after C* process terminates. Design: - before starting watch_for_log() make sure that C* process is running - during watching logs for "started" msg raise NodeError if C* process is terminated Final note: One of the major problems with this change is the contract for Node::start() and Cluster::start(). And contract for Node::watch_for_logs() Officialy it is: return some pair of match if found or raise TimeoutError. But it can also return None (unfortunately no clear contract when). Or raise RuntimeError. And the callers do not use return value at all. Sometimes even a RuntimeError is swallowed silently. This makes changing the behavior so fragile...
merged with 89af930 |
Rationale:
When starting C* ccm checks logs, waiting for particular phrases.
We want to fail test asap if there is a problem with node start.
Before:
If there is any problem with node start it will not happen and start() will hit timeout, by default 90s.
So the test will fail after 90s while it actually fail after 5s.
After:
If logs are streamed ccm will check if node pid is still alive.
If not it will immediately raise
NodeError
to fail test telling that node is not started correctly.Note:
this may break some tests, that are testing node start failure by expecting
TimeoutError
.Such tests should be adjusted, but there should be only a few.
8ba2709 is about some refactoring
8a960d9 is the change