Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medusa not stopping cassandra as service properly. #72

Closed
sandeepmallik opened this issue Feb 11, 2020 · 3 comments
Closed

Medusa not stopping cassandra as service properly. #72

sandeepmallik opened this issue Feb 11, 2020 · 3 comments
Assignees
Labels
bug Something isn't working LHF Low Hanging Fruit
Projects

Comments

@sandeepmallik
Copy link

While I am restoring a single node backup on same node, medusa is not stopping cassandra properly. I doubt it is removing "commitlogs" folder while cassandra shutdown (/etc/init.d/cassandra stop) is happening. So, shutdown is not clean. After restore, cassandra is not starting up as service (/etc/init.d/cassandra start). So I have to run "cassandra stop" and "cassandra start" again. I am on CentOS 6.10.

Medusa has to wait till cassandra is shutdown gracefully before removing commitlogs folder.

[root@localhost ~]# medusa restore-node --backup-name test4 --temp-dir /localhost/data/cassandra/tmp/ --verify --in-place
[2020-02-06 06:40:57,253] WARNING: is ccm : 0
[2020-02-06 06:40:57,283] INFO: Downloading data from backup to /localhost/data/cassandra/tmp/medusa-restore-efc46e67-f299-4e10-bc98-a3397c7fcf97
[2020-02-06 06:41:14,654] INFO: Stopping Cassandra
[2020-02-06 06:41:14,698] INFO: Moving backup data to Cassandra data directory
[2020-02-06 06:41:16,010] INFO: No --seeds specified so we will not wait for any
[2020-02-06 06:41:16,010] INFO: Starting Cassandra
[2020-02-06 06:41:16,024] INFO: Verifying the restore
[2020-02-06 06:41:16,024] INFO: Waiting for Cassandra to come up on localhost.localhost.net
[2020-02-06 06:41:17,833] INFO: Cassandra is up on localhost.localhost.net
[2020-02-06 06:41:17,834] INFO: Executing restore verify query: select * from tutorialspoint.emp;
Exception: Could not establish CQL session after 5

######## Error LOG

INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,674 HintsService.java:209 - Paused hints dispatch
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,677 Server.java:179 - Stop listening for CQL clients
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,678 Gossiper.java:1647 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:14,679 StorageService.java:2442 - Node /127.0.0.1 state jump to shutdown
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:16,681 MessagingService.java:985 - Waiting for messaging service to quiesce
INFO [ACCEPT-/127.0.0.1] 2020-02-06 06:41:16,682 MessagingService.java:1346 - MessagingService has terminated the accept() thread
INFO [StorageServiceShutdownHook] 2020-02-06 06:41:16,988 HintsService.java:209 - Paused hints dispatch
ERROR [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,993 StorageService.java:465 - Stopping gossiper
WARN [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,994 StorageService.java:322 - Stopping gossip by operator request
INFO [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,994 Gossiper.java:1647 - Announcing shutdown
INFO [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:16,995 StorageService.java:2442 - Node /127.0.0.1 state jump to shutdown
ERROR [StorageServiceShutdownHook] 2020-02-06 06:41:16,999 AbstractCommitLogSegmentManager.java:313 - Failed waiting for a forced recycle of in-use commit log segments
java.lang.AssertionError: attempted to delete non-existing file CommitLog-6-1580971059608.log
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:133) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:160) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManagerStandard.discard(CommitLogSegmentManagerStandard.java:37) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.archiveAndDiscard(AbstractCommitLogSegmentManager.java:329) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.forceRecycleAll(AbstractCommitLogSegmentManager.java:303) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLog.forceRecycleAllSegments(CommitLog.java:208) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.service.StorageService.drain(StorageService.java:4693) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:681) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.5.jar:3.11.5]
at java.lang.Thread.run(Unknown Source) ~[na:1.8.0_171]
ERROR [COMMIT-LOG-ALLOCATOR] 2020-02-06 06:41:18,997 CommitLog.java:464 - Failed managing commit log segments. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.nio.file.NoSuchFileException: /localhost/data/cassandra/commitlog/CommitLog-6-1580971059610.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:174) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.MemoryMappedSegment.(MemoryMappedSegment.java:45) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLogSegment.createSegment(CommitLogSegment.java:131) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.CommitLogSegmentManagerStandard.createSegment(CommitLogSegmentManagerStandard.java:78) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager$1.runMayThrow(AbstractCommitLogSegmentManager.java:110) ~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.5.jar:3.11.5]
at java.lang.Thread.run(Unknown Source) ~[na:1.8.0_171]
Caused by: java.nio.file.NoSuchFileException: /localhost/data/cassandra/commitlog/CommitLog-6-1580971059610.log
at sun.nio.fs.UnixException.translateToIOException(Unknown Source) ~[na:1.8.0_171]
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) ~[na:1.8.0_171]
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) ~[na:1.8.0_171]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(Unknown Source) ~[na:1.8.0_171]
at java.nio.channels.FileChannel.open(Unknown Source) ~[na:1.8.0_171]
at java.nio.channels.FileChannel.open(Unknown Source) ~[na:1.8.0_171]
at org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:169) ~[apache-cassandra-3.11.5.jar:3.11.5]
... 7 common frames omitted

@adejanovski
Copy link
Contributor

This can be overriden in the configuration file, where you can provide your own stop/start commands.
We usually don't care about properly draining the nodes because we're going to replace all the data anyway, which is why a "dirty" stop is fine, but I agree that the service should be stopped so you don't need to stop/start it again.
We're using /etc/init.d/cassandra start because we need to enforce the tokens in some cases, and pass them using a JVM flag. I guess we could workaround this by modifying the cassandra-env.sh file instead, although that could be a problem to some folks. We need to give this some thoughts.

It's interesting that /etc/init.d/cassandra stop returns before the node is actually down. Looks like we need a bit more checks to make sure it is down before proceeding with the cleanup and restore tasks.

@sandeepmallik
Copy link
Author

sandeepmallik commented Feb 12, 2020

@adejanovski Yes. Something like fetch PID of cassandra, stop cassandra and make sure PID doesn't exist.

root@ubuntu:# CassPid=$(pgrep -f cassandra)
root@ubuntu:# echo $CassPid
1376

@sandeepmallik
Copy link
Author

@adejanovski After stopping cassandra, added sleep to fix it. This may not be right solution but works for me.

restore_node.py
logging.info('Waiting for cassandra to stop. Sleeping for 10 seconds')
time.sleep(10)

Also, verify with CQL is not working properly. It should be due to medusa is not waiting for CQL native transport port (9042) to be available before running select query. I added sleep(20) after cassandra starts then verify worked.

[2020-02-13 04:30:28,983] INFO: Executing restore verify query: select * from tutorialspoint.emp;
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/medusa/cassandra_utils.py", line 84, in new_session raise Exception('Could not establish CQL session after {attempts}'.format(attempts=attempts)) Exception: Could not establish CQL session after 5

@arodrime arodrime added bug Something isn't working LHF Low Hanging Fruit labels Apr 28, 2020
@adejanovski adejanovski added this to Backlog in TLP OSS via automation May 6, 2020
@adejanovski adejanovski moved this from Backlog to To do in TLP OSS May 11, 2020
@rzvoncek rzvoncek moved this from To do to In progress in TLP OSS May 21, 2020
TLP OSS automation moved this from In progress to Done Jun 15, 2020
WentingWu666666 pushed a commit to WentingWu666666/cassandra-medusa that referenced this issue Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working LHF Low Hanging Fruit
Projects
TLP OSS
  
Done
Development

No branches or pull requests

4 participants