Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hit exception writing heading bytes XXXXX #76

Closed
Lobo2008 opened this issue Aug 8, 2022 · 8 comments
Closed

hit exception writing heading bytes XXXXX #76

Lobo2008 opened this issue Aug 8, 2022 · 8 comments

Comments

@Lobo2008
Copy link

Lobo2008 commented Aug 8, 2022

Running a 1TB~3TB Spark Application,it always failed after running several hours.
blow is the Exception

Stage 0:>                                                       (0 + 0) / 1000]22/08/06 13:07:28 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: 
Aborting TaskSet 0.0 because task 886 (partition 886)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 107.1 in stage 0.0 (TID 1219, 10.203.23.201, executor 463): com.uber.rss.exceptions.RssNetworkException: writeRowGroup: hit exception writing heading bytes 13586, DataBlockSyncWriteClient 82 [/XXXXXX.201:47560 -> MY_RSS_HOST/10.XXXXX.230:12202 (XXXXXXX)], SocketException (Broken pipe)
	at com.uber.rss.clients.DataBlockSyncWriteClient.writeData(DataBlockSyncWriteClient.java:133)
	at com.uber.rss.clients.PlainShuffleDataSyncWriteClient.writeDataBlock(PlainShuffleDataSyncWriteClient.java:40)
	at com.uber.rss.clients.ServerIdAwareSyncWriteClient.writeDataBlock(ServerIdAwareSyncWriteClient.java:73)
	at com.uber.rss.clients.ReplicatedWriteClient.lambda$writeDataBlock$2(ReplicatedWriteClient.java:82)
	at com.uber.rss.clients.ReplicatedWriteClient.runAllActiveClients(ReplicatedWriteClient.java:154)
	at com.uber.rss.clients.ReplicatedWriteClient.writeDataBlock(ReplicatedWriteClient.java:78)
	at com.uber.rss.clients.MultiServerSyncWriteClient.writeDataBlock(MultiServerSyncWriteClient.java:124)
	at com.uber.rss.clients.LazyWriteClient.writeDataBlock(LazyWriteClient.java:99)
	at org.apache.spark.shuffle.RssShuffleWriter$$anonfun$sendDataBlocks$1.apply(RssShuffleWriter.scala:166)
	at org.apache.spark.shuffle.RssShuffleWriter$$anonfun$sendDataBlocks$1.apply(RssShuffleWriter.scala:161)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.shuffle.RssShuffleWriter.sendDataBlocks(RssShuffleWriter.scala:161)
	at org.apache.spark.shuffle.RssShuffleWriter.write(RssShuffleWriter.scala:108)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:415)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1403)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:421)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Broken pipe
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:141)
	at com.uber.rss.clients.DataBlockSyncWriteClient.writeData(DataBlockSyncWriteClient.java:131)
@hiboyang
Copy link
Contributor

hiboyang commented Aug 8, 2022

There is max bytes limit in shuffle server to protect the server, see https://github.com/uber/RemoteShuffleService/blob/master/src/main/java/com/uber/rss/execution/ShuffleExecutor.java#L81

You could change that value if your shuffle data exceeds that limit.

@Lobo2008
Copy link
Author

Lobo2008 commented Aug 8, 2022

Thanks,I'll try it

@mayurdb
Copy link
Collaborator

mayurdb commented Aug 8, 2022

Hi, @Lobo2008 Let us know as Bo mentioned, if the max app shuffle data size per server is the issue or not. You should see a RssTooMuchDataException in the stack trace.

If that's not the issue, please check

  • are you using the latest master
  • what's task time of the failing task and shuffle data written

@Lobo2008
Copy link
Author

Lobo2008 commented Aug 8, 2022

Hi @mayurdb

  • It's the latest version. I cloned and compiled the master branch in April 2022.
  • no RssTooMuchDataException ever happened, just RssNetworkException
  • I have re-run the app without change the size as Bo mentioned ( i'll try it later) and so far it runs well. I'll post the detail if the application finished or failed
  • Wonder if the DEFAULT_APP_MAX_WRITE_BYTES=3TB is one stage shuffle size limitation or the accumulative size of all the shuffle write(?) stages for one application ? Stage-6 has 3TB but still works fine.

image

@cpd85
Copy link

cpd85 commented Aug 9, 2022

I think that DEFAULT_APP_MAX_WRITE_BYTES is actually per server, so if you write 3TB of data but evenly distribute it to multiple servers you would not run into the issue

@Lobo2008
Copy link
Author

I think that DEFAULT_APP_MAX_WRITE_BYTES is actually per server, so if you write 3TB of data but evenly distribute it to multiple servers you would not run into the issue

I guess so.

@Lobo2008
Copy link
Author

Hi @mayurdb

  • It's the latest version. I cloned and compiled the master branch in April 2022.
  • no RssTooMuchDataException ever happened, just RssNetworkException
  • I have re-run the app without change the size as Bo mentioned ( i'll try it later) and so far it runs well. I'll post the detail if the application finished or failed
  • Wonder if the DEFAULT_APP_MAX_WRITE_BYTES=3TB is one stage shuffle size limitation or the accumulative size of all the shuffle write(?) stages for one application ? Stage-6 has 3TB but still works fine.

image

Finished successfully. But I found that the exception hit exception writing heading bytes is caused by one or some of RSS running out of disk storage space.

@hiboyang
Copy link
Contributor

Cool, glad you found the cause, and thanks for the update!

@Lobo2008 Lobo2008 closed this as completed Oct 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants