Medusa not stopping cassandra as service properly. #72

sandeepmallik · 2020-02-11T03:00:20Z

While I am restoring a single node backup on same node, medusa is not stopping cassandra properly. I doubt it is removing "commitlogs" folder while cassandra shutdown (/etc/init.d/cassandra stop) is happening. So, shutdown is not clean. After restore, cassandra is not starting up as service (/etc/init.d/cassandra start). So I have to run "cassandra stop" and "cassandra start" again. I am on CentOS 6.10.

Medusa has to wait till cassandra is shutdown gracefully before removing commitlogs folder.

[root@localhost ~]# medusa restore-node --backup-name test4 --temp-dir /localhost/data/cassandra/tmp/ --verify --in-place
[2020-02-06 06:40:57,253] WARNING: is ccm : 0
[2020-02-06 06:40:57,283] INFO: Downloading data from backup to /localhost/data/cassandra/tmp/medusa-restore-efc46e67-f299-4e10-bc98-a3397c7fcf97
[2020-02-06 06:41:14,654] INFO: Stopping Cassandra
[2020-02-06 06:41:14,698] INFO: Moving backup data to Cassandra data directory
[2020-02-06 06:41:16,010] INFO: No --seeds specified so we will not wait for any
[2020-02-06 06:41:16,010] INFO: Starting Cassandra
[2020-02-06 06:41:16,024] INFO: Verifying the restore
[2020-02-06 06:41:16,024] INFO: Waiting for Cassandra to come up on localhost.localhost.net
[2020-02-06 06:41:17,833] INFO: Cassandra is up on localhost.localhost.net
[2020-02-06 06:41:17,834] INFO: Executing restore verify query: select * from tutorialspoint.emp;
Exception: Could not establish CQL session after 5

######## Error LOG

INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,674 HintsService.java:209 - Paused hints dispatch
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,677 Server.java:179 - Stop listening for CQL clients
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,678 Gossiper.java:1647 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,679 StorageService.java:2442 - Node /127.0.0.1 state jump to shutdown
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:16,681 MessagingService.java:985 - Waiting for messaging service to quiesce
INFO [ACCEPT-/127.0.0.1] 2020-02-06 06:41:16,682 MessagingService.java:1346 - MessagingService has terminated the accept() thread
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:16,988 HintsService.java:209 - Paused hints dispatch
ERROR [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,993 StorageService.java:465 - Stopping gossiper
WARN [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,994 StorageService.java:322 - Stopping gossip by operator request
INFO [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,994 Gossiper.java:1647 - Announcing shutdown
INFO [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,995 StorageService.java:2442 - Node /127.0.0.1 state jump to shutdown
ERROR [StorageServiceShutdownHook] 2020-02-06 06:41:16,999 AbstractCommitLogSegmentManager.java:313 - Failed waiting for a forced recycle of in-use commit log segments
java.lang.AssertionError: attempted to delete non-existing file CommitLog-6-1580971059608.log
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:133) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:160) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManagerStandard.discard(CommitLogSegmentManagerStandard.java:37) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.archiveAndDiscard(AbstractCommitLogSegmentManager.java:329) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.forceRecycleAll(AbstractCommitLogSegmentManager.java:303) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLog.forceRecycleAllSegments(CommitLog.java:208) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.service.StorageService.drain(StorageService.java:4693) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:681) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.5.jar:3.11.5]
at java.lang.Thread.run(Unknown Source) ~[na:1.8.0_171]
ERROR [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:18,997 CommitLog.java:464 - Failed managing commit log segments. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.nio.file.NoSuchFileException: /localhost/data/cassandra/commitlog/CommitLog-6-1580971059610.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:174) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.MemoryMappedSegment.(MemoryMappedSegment.java:45) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLogSegment.createSegment(CommitLogSegment.java:131) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManagerStandard.createSegment(CommitLogSegmentManagerStandard.java:78) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager$1.runMayThrow(AbstractCommitLogSegmentManager.java:110) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.5.jar:3.11.5]
at java.lang.Thread.run(Unknown Source) ~[na:1.8.0_171]
Caused by: java.nio.file.NoSuchFileException: /localhost/data/cassandra/commitlog/CommitLog-6-1580971059610.log
at sun.nio.fs.UnixException.translateToIOException(Unknown Source) ~[na:1.8.0_171]
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) ~[na:1.8.0_171]
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) ~[na:1.8.0_171]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(Unknown Source) ~[na:1.8.0_171]
at java.nio.channels.FileChannel.open(Unknown Source) ~[na:1.8.0_171]
at java.nio.channels.FileChannel.open(Unknown Source) ~[na:1.8.0_171]
at org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:169) ~[apache-cassandra-3.11.5.jar:3.11.5]
... 7 common frames omitted

adejanovski · 2020-02-11T16:24:12Z

This can be overriden in the configuration file, where you can provide your own stop/start commands.
We usually don't care about properly draining the nodes because we're going to replace all the data anyway, which is why a "dirty" stop is fine, but I agree that the service should be stopped so you don't need to stop/start it again.
We're using /etc/init.d/cassandra start because we need to enforce the tokens in some cases, and pass them using a JVM flag. I guess we could workaround this by modifying the cassandra-env.sh file instead, although that could be a problem to some folks. We need to give this some thoughts.

It's interesting that /etc/init.d/cassandra stop returns before the node is actually down. Looks like we need a bit more checks to make sure it is down before proceeding with the cleanup and restore tasks.

sandeepmallik · 2020-02-12T04:48:50Z

@adejanovski Yes. Something like fetch PID of cassandra, stop cassandra and make sure PID doesn't exist.

root@ubuntu:# CassPid=$(pgrep -f cassandra)
root@ubuntu:# echo $CassPid
1376

sandeepmallik · 2020-02-13T05:16:50Z

@adejanovski After stopping cassandra, added sleep to fix it. This may not be right solution but works for me.

restore_node.py
logging.info('Waiting for cassandra to stop. Sleeping for 10 seconds')
time.sleep(10)

Also, verify with CQL is not working properly. It should be due to medusa is not waiting for CQL native transport port (9042) to be available before running select query. I added sleep(20) after cassandra starts then verify worked.

[2020-02-13 04:30:28,983] INFO: Executing restore verify query: select * from tutorialspoint.emp;
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/medusa/cassandra_utils.py", line 84, in new_session raise Exception('Could not establish CQL session after {attempts}'.format(attempts=attempts)) Exception: Could not establish CQL session after 5

…e#72

arodrime added bug Something isn't working LHF Low Hanging Fruit labels Apr 28, 2020

adejanovski added this to Backlog in TLP OSS via automation May 6, 2020

adejanovski moved this from Backlog to To do in TLP OSS May 11, 2020

adejanovski assigned rzvoncek May 11, 2020

rzvoncek pushed a commit that referenced this issue May 21, 2020

Add wait for node shutdown. Fix healtcheck config. Fixes #72

b782aa5

rzvoncek mentioned this issue May 21, 2020

Add wait for node shutdown. Fix healtcheck config. Fixes #72 #138

Merged

rzvoncek moved this from To do to In progress in TLP OSS May 21, 2020

rzvoncek pushed a commit that referenced this issue May 21, 2020

Add wait for node shutdown. Fix healtcheck config. Fixes #72

64eceae

rzvoncek pushed a commit that referenced this issue May 21, 2020

Add wait for node shutdown. Fix healtcheck config. Fixes #72

4046b0f

rzvoncek pushed a commit that referenced this issue Jun 9, 2020

Add wait for node shutdown. Fix healtcheck config. Fixes #72

58cf984

rzvoncek pushed a commit that referenced this issue Jun 9, 2020

Add wait for node shutdown. Fix healtcheck config. Fixes #72

6df83d3

rzvoncek closed this as completed in f3e2415 Jun 15, 2020

TLP OSS automation moved this from In progress to Done Jun 15, 2020

WentingWu666666 pushed a commit to WentingWu666666/cassandra-medusa that referenced this issue Oct 12, 2022

Add wait for node shutdown. Fix healtcheck config. Fixes thelastpickl…

894c763

…e#72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Medusa not stopping cassandra as service properly. #72

Medusa not stopping cassandra as service properly. #72

sandeepmallik commented Feb 11, 2020

adejanovski commented Feb 11, 2020

sandeepmallik commented Feb 12, 2020 •

edited

sandeepmallik commented Feb 13, 2020

Medusa not stopping cassandra as service properly. #72

Medusa not stopping cassandra as service properly. #72

Comments

sandeepmallik commented Feb 11, 2020

adejanovski commented Feb 11, 2020

sandeepmallik commented Feb 12, 2020 • edited

sandeepmallik commented Feb 13, 2020

sandeepmallik commented Feb 12, 2020 •

edited