Restore with sstableloader does not use fqdn present in configuration #148

giom-l · 2020-05-28T13:25:46Z

When restoring a cluster with sstableloader, there seems to be some issues with fqdn

Our cluster is EC2-based and for each node, we have 2 network interfaces :

one for the internal node (provided by default by AWS)
one we attach and use for cluster communication (so that if a node is terminated, it is replaced and attached to the same previous network interface)

When running a restore using sstableloader via this command :

nohup medusa restore-cluster --backup-name=dev-14-05-2020-full --table=whatever.customer_190416_bob --seed-target=ip-172-29-180-216.eu-west-1.compute.internal
We have the following output

[2020-05-28 09:54:49,508] INFO: Monitoring provider is noop
[2020-05-28 09:54:49,508] INFO: system_auth keyspace will be overwritten with the backup on target nodes
[2020-05-28 09:54:50,090] INFO: Ensuring the backup is found and is complete
[2020-05-28 09:54:50,124] INFO: Restore will happen "In-Place", no new hardware is involved
[2020-05-28 09:54:50,154] INFO: Tokenmap is differently distributed. Extra items: {'[truncated list of tokens]'}
[2020-05-28 09:54:50,156] INFO: Starting cluster restore...
[2020-05-28 09:54:50,156] INFO: Working directory for this execution: /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-180-216.eu-west-1.compute.internal using {'source': ['ip-172-29-181-241.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-182-200.eu-west-1.compute.internal using {'source': ['ip-172-29-181-163.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-182-219.eu-west-1.compute.internal using {'source': ['ip-172-29-180-194.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-181-207.eu-west-1.compute.internal using {'source': ['ip-172-29-180-186.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,156] INFO: About to restore on ip-172-29-180-190.eu-west-1.compute.internal using {'source': ['ip-172-29-182-156.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,157] INFO: About to restore on ip-172-29-181-208.eu-west-1.compute.internal using {'source': ['ip-172-29-182-138.eu-west-1.compute.internal'], 'seed': False} as backup source
[2020-05-28 09:54:50,157] INFO: This will delete all data on the target nodes and replace it with backup dev-14-05-2020-full.
[2020-05-28 09:54:50,157] INFO: target seeds : []
[2020-05-28 09:54:50,157] INFO: Restoring schema on the target cluster
[2020-05-28 10:09:03,865] INFO: (Re)creating schema for keyspace whatever
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-180-216.eu-west-1.compute.internal...
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-182-200.eu-west-1.compute.internal...
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-182-219.eu-west-1.compute.internal...
[2020-05-28 10:09:26,094] INFO: Restoring data on ip-172-29-181-207.eu-west-1.compute.internal...
[2020-05-28 10:09:26,095] INFO: Restoring data on ip-172-29-180-190.eu-west-1.compute.internal...
[2020-05-28 10:09:26,095] INFO: Restoring data on ip-172-29-181-208.eu-west-1.compute.internal...
[2020-05-28 10:09:26,097] INFO: Executing "nohup sh -c "mkdir /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139; cd /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139 && medusa-wrapper sudo medusa --fqdn=%s -vvv restore-node --in-place  %s --no-verify --backup-name dev-14-05-2020-full --temp-dir /opt/cassandra/data/tmp --use-sstableloader  --table whatever.customer_190416_bob"" on all nodes.
[2020-05-28 10:11:08,607] INFO: Job executing "nohup sh -c "mkdir /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139; cd /opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139 && medusa-wrapper sudo medusa --fqdn=%s -vvv restore-node --in-place  %s --no-verify --backup-name dev-14-05-2020-full --temp-dir /opt/cassandra/data/tmp --use-sstableloader  --table whatever.customer_190416_bob"" ran and finished with errors on following nodes: ['ip-172-29-180-190.eu-west-1.compute.internal', 'ip-172-29-180-216.eu-west-1.compute.internal', 'ip-172-29-181-207.eu-west-1.compute.internal', 'ip-172-29-181-208.eu-west-1.compute.internal', 'ip-172-29-182-200.eu-west-1.compute.internal', 'ip-172-29-182-219.eu-west-1.compute.internal']
[2020-05-28 10:11:08,608] INFO: [ip-172-29-180-216.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-180-216.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: [ip-172-29-182-200.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-182-200.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: [ip-172-29-182-219.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-182-219.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: [ip-172-29-181-207.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,608] INFO: ip-172-29-181-207.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: [ip-172-29-180-190.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: ip-172-29-180-190.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: [ip-172-29-181-208.eu-west-1.compute.internal]	nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,609] INFO: ip-172-29-181-208.eu-west-1.compute.internal-stdout: nohup: ignoring input and appending output to ‘nohup.out’
[2020-05-28 10:11:08,610] ERROR: Some nodes failed to restore. Exiting
[2020-05-28 10:11:08,610] ERROR: This error happened during the cluster restore: Some nodes failed to restore. Exiting
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/cassandra_medusa-0.7.0.dev0-py3.7.egg/medusa/restore_cluster.py", line 72, in orchestrate
    restore.execute()
  File "/usr/local/lib/python3.7/site-packages/cassandra_medusa-0.7.0.dev0-py3.7.egg/medusa/restore_cluster.py", line 146, in execute
    self._restore_data()
  File "/usr/local/lib/python3.7/site-packages/cassandra_medusa-0.7.0.dev0-py3.7.egg/medusa/restore_cluster.py", line 350, in _restore_data
    raise Exception(err_msg)
Exception: Some nodes failed to restore. Exiting

On each node, we can find the following log in
/opt/cassandra/data/tmp/medusa-job-4c6dbf69-a756-4b98-b477-962a36980139/stderr

Established connection to initial hosts
Opening sstables and calculating sections to stream
Streaming relevant part of /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-266-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-289-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-308-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-313-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-318-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-319-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-320-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-321-big-Data.db /opt/apache-cassandra-3.11.2/data/tmp/medusa-restore-15a118af-88e7-4bbe-a73a-adee71a43150/whatever/customer_190416_bob-4bf290b057b011ea8398d99fe6b0f187/mc-342-big-Data.db  to [/172.29.181.207, /172.29.180.190, /172.29.182.200, /172.29.182.219, /172.29.181.208, ip-172-29-180-160.eu-west-1.compute.internal/172.29.180.160]
ERROR 10:15:51,429 [Stream #345045f0-a0cc-11ea-b7b4-f76b1362a6ed] Streaming error occurred on session with peer 172.29.180.160
java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_252]
	at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_252]
	at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_252]
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645) ~[na:1.8.0_252]
	at java.nio.channels.SocketChannel.open(SocketChannel.java:189) ~[na:1.8.0_252]
	at org.apache.cassandra.tools.BulkLoadConnectionFactory.createConnection(BulkLoadConnectionFactory.java:60) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.StreamSession.createConnection(StreamSession.java:282) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.ConnectionHandler.initiate(ConnectionHandler.java:86) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:269) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:263) [apache-cassandra-3.11.2.jar:3.11.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252]
	at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) [apache-cassandra-3.11.2.jar:3.11.2]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_252]

[ Then streaming occurs and finish]
[ When reaching 100%, it fails with ]

WARN  10:21:34,756 [Stream #345045f0-a0cc-11ea-b7b4-f76b1362a6ed] Stream failed
Streaming to the following hosts failed:
[ip-172-29-180-160.eu-west-1.compute.internal/172.29.180.160]
java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:98)
	at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:48)
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
	at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:88)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
	at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
	at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215)
	at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191)
	at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481)
	at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:682)
	at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:532)
	at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" org.apache.cassandra.tools.BulkLoadException: java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
	at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:114)
	at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:48)
Caused by: java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:98)
	... 1 more
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
	at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:88)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
	at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
	at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215)
	at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191)
	at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481)
	at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:682)
	at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:532)
	at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317)
	at java.lang.Thread.run(Thread.java:748)

This happen because of the command sstableloader run locally.
It uses the local internal fqdn instead of using the fqdn provided in the node medusa configuration file.

If we run the command manually and provide the fqdn that is in the config, it runs flawlessly.

I'm wondering if that could be fixed in 12e0532

Otherwise, I opened a PR linked below

┆Issue is synchronized with this Jira Task by Unito
┆Issue Number: K8SSAND-278

The text was updated successfully, but these errors were encountered:

adejanovski added the new Issues requiring triage label May 28, 2020

giom-l linked a pull request May 28, 2020 that will close this issue

sstableloader to use locally configured fqdn when set #149

Open

adejanovski added the to-groom label Jun 5, 2023

rzvoncek added bug Something isn't working and removed new Issues requiring triage to-groom labels Apr 3, 2024

adejanovski added the ready-for-review Issues in the state 'ready-for-review' label Apr 3, 2024

rzvoncek added to-groom and removed to-groom labels Apr 3, 2024

adejanovski added blocked Issues in the state 'blocked' and removed ready-for-review Issues in the state 'ready-for-review' labels Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore with sstableloader does not use fqdn present in configuration #148

Restore with sstableloader does not use fqdn present in configuration #148

giom-l commented May 28, 2020 •

edited by adejanovski

Loading

Restore with sstableloader does not use fqdn present in configuration #148

Restore with sstableloader does not use fqdn present in configuration #148

Comments

giom-l commented May 28, 2020 • edited by adejanovski Loading

giom-l commented May 28, 2020 •

edited by adejanovski

Loading