Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAR-57. Check if pid is alive while waiting for logs #724

Closed
wants to merge 1 commit into from

Conversation

tlasica
Copy link
Contributor

@tlasica tlasica commented Jan 14, 2021

Rationale:
When starting C* ccm checks logs, waiting for particular phrases.
We want to fail test asap if there is a problem with node start.

Before:
If there is any problem with node start it will not happen and start() will hit timeout, by default 90s.
So the test will fail after 90s while it actually fail after 5s.

After:
If logs are streamed ccm will check if node pid is still alive.
If not it will immediately raise NodeError to fail test telling that node is not started correctly.

Note:
this may break some tests, that are testing node start failure by expecting TimeoutError.
Such tests should be adjusted, but there should be only a few.

8ba2709 is about some refactoring
8a960d9 is the change

@tlasica
Copy link
Contributor Author

tlasica commented Jan 14, 2021

Kind of interesting that build fails on python 3.7.7

ccmlib/node.py Outdated
pid_before_check = self.pid
self.__update_status()
if not self.is_running():
raise RuntimeError("node {n} with pid {pid} is not running".format(n=self.name, pid=pid_before_check))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it can cause problems... in the node.stop()
it should not but there is a possible race between reading log and the moment pid disappears

@tlasica tlasica closed this Jan 14, 2021
@tlasica tlasica reopened this Jan 14, 2021
@tlasica
Copy link
Contributor Author

tlasica commented Jan 14, 2021

retest this please

@tlasica
Copy link
Contributor Author

tlasica commented Jan 14, 2021

hi, may I ask why did you close this PR @michaelsembwever ?

@michaelsembwever
Copy link
Collaborator

That was unintentional.

It looks like when I deleted the cassandra-test branch all PRs pointing to it got automatically closed.

ref: https://issues.apache.org/jira/browse/CASSANDRA-16383

@michaelsembwever michaelsembwever changed the base branch from cassandra-test to master January 14, 2021 21:09
@tlasica
Copy link
Contributor Author

tlasica commented Jan 15, 2021

@michaelsembwever I see, so I can just rebase branch to master ...

@michaelsembwever
Copy link
Collaborator

I can just rebase branch to master ...

Yes :)

@tlasica tlasica force-pushed the STAR-57 branch 2 times, most recently from 8943139 to 7bed88a Compare January 15, 2021 16:27
@@ -518,8 +516,7 @@ def start(self, no_wait=False, verbose=False, wait_for_binary_proto=True,

# if the node is going to allocate_strategy_ tokens during start, then wait_for_binary_proto=True
node_wait_for_binary_proto = (self.can_generate_tokens() and self.use_vnodes and node.initial_token is None)

p = node.start(update_pid=False, jvm_args=jvm_args, jvm_version=jvm_version,
p = node.start(update_pid=True, jvm_args=jvm_args, jvm_version=jvm_version,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tricky change... I am not 100% it is correct
I wonder why we could start with update_pid=False?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

ccmlib/node.py Outdated
@@ -558,21 +558,35 @@ def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbos
matchings.append((line, m))
tofind.remove(e)
if len(tofind) == 0:
common.debug("return")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TO BE REMOVED

ccmlib/node.py Outdated
common.warning(msg)
raise NodeError(msg)

# process is in this case process that started C*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another tricky part....

Possibly the intention of this part of the code was to actually finish immediately if child process terminated...
return None as not supper effective as results are not checked.

There is one interesting thing:
when checking None it can fail positive tests: test3 or test2 which is kind of unexpected.

ccmlib/node.py Outdated
@@ -639,7 +653,14 @@ def wait_for_binary_interface(self, **kwargs):
Emits a warning if not listening after NODE_WAIT_TIMEOUT_IN_SECS seconds.
"""
if self.cluster.version() >= '1.2':
self.watch_log_for("Starting listening for CQL clients", **kwargs)
start_time = time.time()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove logging

@@ -1857,55 +1878,57 @@ def _update_topology_file(self):
with open(topology_file, 'w') as f:
f.write(content)

def _is_pid_running(self):
Copy link
Contributor Author

@tlasica tlasica Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to extract is_pid_running()
and to make status updating logic simpler

ccmlib/node.py Outdated
common.warning(msg)
raise NodeError(msg)

# this can preliminary end checking (if child process is terminated)
Copy link
Contributor Author

@tlasica tlasica Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC: if child process terminated => return None
but this None is not checked anywhere....
and when I added check some normal tests like start simply started to fail...

@tlasica tlasica marked this pull request as draft January 15, 2021 18:08
Copy link
Collaborator

@michaelsembwever michaelsembwever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. (thanks for the clean-up and the cleaner code alternatives).

I'm keen to see a full ci-cassandra run against this, when ready.
And what the heck update_pid does…?

@tlasica
Copy link
Contributor Author

tlasica commented Jan 15, 2021

@michaelsembwever thanks!

the fix is incomplete or simply incorrect... I have run full cassandra-dtest suite and there are some problems, for example with tests that by design want to crash nodes on CRC problems. so it will have to be more subtle.

@tlasica tlasica force-pushed the STAR-57 branch 5 times, most recently from 7ae1648 to 7e35d1a Compare January 21, 2021 12:54
@tlasica tlasica marked this pull request as ready for review January 21, 2021 13:12
@tlasica tlasica force-pushed the STAR-57 branch 3 times, most recently from 4b958c5 to 9be89b9 Compare January 25, 2021 14:40
ccmlib/node.py Outdated
@@ -630,17 +659,30 @@ def watch_log_for_alive(self, nodes, from_mark=None, timeout=120, filename='syst
tofind = ["%s.* now UP" % node.address_for_version(self.get_cassandra_version()) for node in tofind]
self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, filename=filename)

def raise_node_error_if_cassandra_process_is_terminated(self):
up = self._is_pid_running()
common.debug("checking if C* process with pid {pid} is running: {up}".format(pid=self.pid, up=up))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be removed after testing

@tlasica tlasica force-pushed the STAR-57 branch 2 times, most recently from 6a78c88 to ac4359e Compare January 25, 2021 16:31
@tlasica
Copy link
Contributor Author

tlasica commented Jan 26, 2021

@michaelsembwever This one is ready for re-review.

Would be great to have 2nd pair of eyes.

@tlasica
Copy link
Contributor Author

tlasica commented Jan 26, 2021

Running some circleCI: https://app.circleci.com/pipelines/github/tlasica/cassandra?branch=ccm-star-57

There are two tests that need to catch both TimeoutError and NodeError (as technically both can be thrown).
In fact even before that change both can be thrown.

@tlasica
Copy link
Contributor Author

tlasica commented Jan 27, 2021

both runs [some test adjustments I plan to add as CASSANDRA-14605] (apache/cassandra-dtest@trunk...tlasica:ccm-star-57) look pretty good to me
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/332/ against trunk
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/330/ against cassandra-3.11

Copy link
Contributor

@jacek-lewandowski jacek-lewandowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -374,11 +369,14 @@ def balanced_tokens_across_dcs(self, dcs):
tokens.extend(new_tokens)
return tokens

def _more_than_one_token_configured(self):
num_tokens = self._config_options.get('num_tokens', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just int(self._config_options.get('num_tokens', '1')) > 1

ccmlib/cluster.py Show resolved Hide resolved
ccmlib/node.py Outdated
@@ -519,12 +528,19 @@ def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbos
if len(tofind) == 0:
return None

def check_timeout(msg):
if start + timeout < time.time():
tstamp = time.strftime("%d %b %Y %H:%M:%S", time.gmtime())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: time format could be extracted to some global / static / final field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe this whole function deserves to be top level - it looks pretty generic

ccmlib/node.py Outdated
if process.returncode == 0:
return None
# check if abort condition is not met e.g. C* process is terminated
if abort_function is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the logic of abort function mixed me up - by calling abort_function I conclude that we want to abort at this point. But it is more like maybe_abort... can be this more explicit? Maybe just pass a bool param - can_abort or something and then inline that cancellation code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it as well... but the problem is this watch_for_logs is generic, used in tons of places.
So I wanted to have control about the condition that is used to abort...

I can try to change it again and instead of a function use a boolean: stop_if_cassandra_pid_terminated...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just pass a condition

ccmlib/node.py Outdated
# or if there is some race condition between log checking and start process finish
# I am simply afraid to completely remove it because this is general usage method
# so instead we will use it IFF the pid-based abort function is not provided
if process and abort_function is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't abort_function is None redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why redundant?
It is indeed checked earlier BUT it cannot not throw (it is conditional).
Indeed, I need to change naming ;-)

this is what makes this whole change so complex:
process - is a process that started C*
abort function is a function based e.g. on C* process pid (different one).

Now there are interesting races:
this process can be terminated before C* process is terminated.
in such case instead of finding C* process dead => raise NodeError
this logic will return None and continue
and then it will timeout on the next step or fail silently

# timeout. Other intlike types, though, we want to use.
if common.is_intlike(wait_for_binary_proto) and not isinstance(wait_for_binary_proto, bool):
# if requested wait for binary protocol to start
if common.is_int_not_bool(wait_for_binary_proto):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about composing two variables

should_wait_for_binary_proto = wait_for_binary_proto is not None
binary_proto_timeout = wait_for_binary_proto if is_int_not_bool(wait_for_binary_proto) else None

and then just call

if should_wait_for_binary_proto: 
    self.wait_for_binary_interface(from_mark=self.mark, timeout=binary_proto_timeout)

Copy link
Contributor Author

@tlasica tlasica Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work straight because wait_for_binary_proto can be False...
it can be fixed, but then...

I will see if it can be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not touch it...

@tlasica
Copy link
Contributor Author

tlasica commented Jan 28, 2021

Running new round of tests after recent changes:
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/338/ cassandra-3.11 🆗 [unrelated failures]
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/339/ trunk 🆗 [no dtest failures]
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/345/ cassandra-3.0 🆗 [unrelated failures]

https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/346/ cassandra-2.2
it has some test failures, but those look pretty much same as on
https://ci-cassandra.apache.org/job/Cassandra-2.2/41/testReport/

Before:
If - for some reason - node will not start correctly, then watch_for_log would raise TimeoutError after specified timeout.
By default this timeout is at least 90s, but can be specified by caller.

After:
If node will not start correctly then it will raise NodeError immediately after C* process terminates.

Design:
- before starting watch_for_log() make sure that C* process is running
- during watching logs for "started" msg raise NodeError if C* process is terminated

Final note:
One of the major problems with this change is the contract for Node::start() and Cluster::start().

And contract for Node::watch_for_logs()
Officialy it is: return some pair of match if found or raise TimeoutError.
But it can also return None (unfortunately no clear contract when).
Or raise RuntimeError.

And the callers do not use return value at all.
Sometimes even a RuntimeError is swallowed silently.

This makes changing the behavior so fragile...
@michaelsembwever
Copy link
Collaborator

merged with 89af930

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants