Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore with sstableloader does not use fqdn present in configuration #148

Open
giom-l opened this issue May 28, 2020 · 0 comments · May be fixed by #149
Open

Restore with sstableloader does not use fqdn present in configuration #148

giom-l opened this issue May 28, 2020 · 0 comments · May be fixed by #149
Labels
blocked Issues in the state 'blocked' bug Something isn't working

Comments

@giom-l
Copy link

giom-l commented May 28, 2020

Project board link

When restoring a cluster with sstableloader, there seems to be some issues with fqdn

Our cluster is EC2-based and for each node, we have 2 network interfaces :

  • one for the internal node (provided by default by AWS)
  • one we attach and use for cluster communication (so that if a node is terminated, it is replaced and attached to the same previous network interface)

When running a restore using sstableloader via this command :

nohup medusa restore-cluster --backup-name=dev-14-05-2020-full --table=whatever.customer_190416_bob --seed-target=ip-172-29-180-216.eu-west-1.compute.internal
We have the following output

[2020-05-28 09:54:49,508] INFO: Monitoring provider is noop
[2020-05-28 09:54:49,508] INFO: system_auth keyspace will be overwritten with the backup on target nodes
[2020-05-28 09:54:50,090] INFO: Ensuring the backup is found and is complete
[2020-05-28 09:54:50,124] INFO: Restore will happen "In-Place", no new hardware is involved
[2020-05-28 09:54:50,154] INFO: Tokenmap is differently distributed. Extra items: {'[truncated list of tokens]'}
[2020-05-28 09:54:50,156] INFO: Starting cluster restore...
[2020-05-28 09:54:50,156] INFO: Working directory for this execution: /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-180-216.eu-west-1.compute.internal using {'source': ['ip-172-29-181-241.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-182-200.eu-west-1.compute.internal using {'source': ['ip-172-29-181-163.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-182-219.eu-west-1.compute.internal using {'source': ['ip-172-29-180-194.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-181-207.eu-west-1.compute.internal using {'source': ['ip-172-29-180-186.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-180-190.eu-west-1.compute.internal using {'source': ['ip-172-29-182-156.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,157] INFO: About to restore on ip-172-29-181-208.eu-west-1.compute.internal using {'source': ['ip-172-29-182-138.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,157] INFO: This will delete all data on the target nodes and replace it with backup dev-14-05-2020-full.
[2020-05-28 09:54:50,157] INFO: target seeds : []
[2020-05-28 09:54:50,157] INFO: Restoring schema on the target cluster
[2020-05-28 10:09:03,865] INFO: (Re)creating schema for keyspace whatever
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-180-216.eu-west-1.compute.internal...
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-182-200.eu-west-1.compute.internal...
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-182-219.eu-west-1.compute.internal...
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-181-207.eu-west-1.compute.internal...
[2020-05-28 10:09:26,095] INFO: Restoring data on ip-172-29-180-190.eu-west-1.compute.internal...
[2020-05-28 10:09:26,095] INFO: Restoring data on ip-172-29-181-208.eu-west-1.compute.internal...
[2020-05-28 10:09:26,097] INFO: Executing "nohup sh -c "mkdir /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139; cd /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139 && medusa-wrapper sudo medusa --fqdn=%s -vvv restore-node --in-place  %s --no-verify --backup-name dev-14-05-2020-full --temp-dir /opt/cassandra/data/tmp --use-sstableloader  --table whatever.customer_190416_bob"" on all nodes.
[2020-05-28 10:11:08,607] INFO: Job executing "nohup sh -c "mkdir /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139; cd /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139 && medusa-wrapper sudo medusa --fqdn=%s -vvv restore-node --in-place  %s --no-verify --backup-name dev-14-05-2020-full --temp-dir /opt/cassandra/data/tmp --use-sstableloader  --table whatever.customer_190416_bob"" ran and finished with errors on following nodes: ['ip-172-29-180-190.eu-west-1.compute.internal', 'ip-172-29-180-216.eu-west-1.compute.internal', 'ip-172-29-181-207.eu-west-1.compute.internal', 'ip-172-29-181-208.eu-west-1.compute.internal', 'ip-172-29-182-200.eu-west-1.compute.internal', 'ip-172-29-182-219.eu-west-1.compute.internal']
[2020-05-28 10:11:08,608] INFO: [ip-172-29-180-216.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-180-216.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: [ip-172-29-182-200.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-182-200.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: [ip-172-29-182-219.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-182-219.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: [ip-172-29-181-207.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-181-207.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: [ip-172-29-180-190.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: ip-172-29-180-190.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: [ip-172-29-181-208.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: ip-172-29-181-208.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,610] ERROR: Some nodes failed to restore. Exiting
[2020-05-28 10:11:08,610] ERROR: This error happened during the cluster restore: Some nodes failed to restore. Exiting
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/cassandra_medusa-0.7.0.dev0-py3.7.egg/medusa/restore_cluster.py", line 72, in orchestrate
    restore.execute()
  File "/usr/local/lib/python3.7/site-packages/cassandra_medusa-0.7.0.dev0-py3.7.egg/medusa/restore_cluster.py", line 146, in execute
    self._restore_data()
  File "/usr/local/lib/python3.7/site-packages/cassandra_medusa-0.7.0.dev0-py3.7.egg/medusa/restore_cluster.py", line 350, in _restore_data
    raise Exception(err_msg)
Exception: Some nodes failed to restore. Exiting

On each node, we can find the following log in
/opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139/stderr

Established connection to initial hosts
Opening sstables and calculating sections to stream
Streaming relevant part of /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-266-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-289-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-308-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-313-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-318-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-319-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-320-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-321-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-342-big-Data.db  to [/172.29.181.207, /172.29.180.190, /172.29.182.200, /172.29.182.219, /172.29.181.208, ip-172-29-180-160.eu-west-1.compute.internal/172.29.180.160]
ERROR 10:15:51,429 [Stream #345045f0-a0cc-11ea-b7b4-f76b1362a6ed] Streaming error occurred on session with peer 172.29.180.160
java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_252]
	at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_252]
	at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_252]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645) ~[na:1.8.0_252]
	at java.nio.channels.SocketChannel.open(SocketChannel.java:189) ~[na:1.8.0_252]
	at org.apache.cassandra.tools.BulkLoadConnectionFactory.createConnection(BulkLoadConnectionFactory.java:60) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.StreamSession.createConnection(StreamSession.java:282) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.ConnectionHandler.initiate(ConnectionHandler.java:86) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:269) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:263) [apache-cassandra-3.11.2.jar:3.11.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252]
	at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) [apache-cassandra-3.11.2.jar:3.11.2]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_252]

[ Then streaming occurs and finish]
[ When reaching 100%, it fails with ]

WARN  10:21:34,756 [Stream #345045f0-a0cc-11ea-b7b4-f76b1362a6ed] Stream failed
Streaming to the following hosts failed:
[ip-172-29-180-160.eu-west-1.compute.internal/172.29.180.160]
java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:98)
	at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:48)
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
	at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:88)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
	at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
	at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215)
	at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191)
	at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481)
	at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:682)
	at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:532)
	at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" org.apache.cassandra.tools.BulkLoadException: java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
	at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:114)
	at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:48)
Caused by: java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:98)
	... 1 more
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
	at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:88)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
	at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
	at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215)
	at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191)
	at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481)
	at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:682)
	at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:532)
	at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317)
	at java.lang.Thread.run(Thread.java:748)

This happen because of the command sstableloader run locally.
It uses the local internal fqdn instead of using the fqdn provided in the node medusa configuration file.

If we run the command manually and provide the fqdn that is in the config, it runs flawlessly.

I'm wondering if that could be fixed in 12e0532

Otherwise, I opened a PR linked below

┆Issue is synchronized with this Jira Task by Unito
┆Issue Number: K8SSAND-278

@adejanovski adejanovski added the new Issues requiring triage label May 28, 2020
@giom-l giom-l linked a pull request May 28, 2020 that will close this issue
@rzvoncek rzvoncek added bug Something isn't working and removed new Issues requiring triage to-groom labels Apr 3, 2024
@adejanovski adejanovski added the ready-for-review Issues in the state 'ready-for-review' label Apr 3, 2024
@rzvoncek rzvoncek added to-groom and removed to-groom labels Apr 3, 2024
@adejanovski adejanovski added blocked Issues in the state 'blocked' and removed ready-for-review Issues in the state 'ready-for-review' labels Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Issues in the state 'blocked' bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants