Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what may cause RssInvalidServerVersionException? #94

Open
Lobo2008 opened this issue Feb 8, 2023 · 2 comments
Open

what may cause RssInvalidServerVersionException? #94

Lobo2008 opened this issue Feb 8, 2023 · 2 comments

Comments

@Lobo2008
Copy link

Lobo2008 commented Feb 8, 2023

Hi, I am wondering:

Q1. if RssInvalidServerVersionException will occur when RSS-i is restarted by a shell script as soon as it crashes due to some reasons meanwhile some applications are still using it. clients still stores the former RSS-i version but actually the version of the newly registered RSS-i is already changed.

# also the other exception may be caused by the same reason?
org.apache.spark.shuffle.FetchFailedException: Detected server restart, current server: Server{rss04.xxx:12203, 1675897753258, rss04xxx:/data/}, previous server: Server{rss04.xxxx:12203, 1675895945858, rss04xxx:/data/} at org.apache.spark.shuffle.RssShuffleManager$$anon$2.resolveConnection(RssShuffleManager.scala:220) at com.uber.rss.clients.ServerConnectionCacheUpdateRefresher.refreshConnection(ServerConnectionCacheUpdateRefresher.java:49) at com.uber.rss.clients.ServerIdAwareSyncWriteClient.connectImpl(ServerIdAwareSyncWriteClient.java:133) at

Q2. What may cause this exception :

org.apache.spark.shuffle.FetchFailedException: Cannot fetch shuffle 0 partition 362 due to RssAggregateException (RssShuffleStageNotStartedException (Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxx44973 -> /10.20xxx:12212 (1xxxx28)])
com.uber.rss.exceptions.RssShuffleStageNotStartedException: Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxxx:44973 -> /10.2xxx12212 (10.xxxx)]
	at com.uber.rss.clients.ClientBase.checkOKResponseStatus(ClientBase.java:291)
	at com.uber.rss.clients.ClientBase.readResponseStatus(ClientBase.java:275)
	at ...
@mayurdb
Copy link
Collaborator

mayurdb commented Mar 20, 2023

Q1
You are right. This happened because server restarted and client had initially connected to earlier server. Ideally should not be an issue. Maybe we can remove this check @hiboyang ?

Q2
That basically means the server you are trying to connect to has not yet received the shuffle data for corresponding partition (Identified using appId, appAttemptId, shuffleId). Is this also happening when the server restarted?

@hiboyang
Copy link
Contributor

hiboyang commented May 3, 2023

Previously RSS does not handle server restart well, thus adding those check. Feel we could remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants