Skip to content
Bo Yang edited this page Oct 14, 2020 · 2 revisions

Problem To Solve

Apache Spark community is working on storing shuffle data on remote storage (SPARK-25299, Discussion Document). This Remote Shuffle Service demonstrates one implementation by streaming shuffle data to shuffle servers which run on separate machines.

Design Options

There are multiple options to implement a shuffle service storing data on remote storage. Following are some examples (not comprehensive list of all possible solutions):

  1. Streaming Based Remote Shuffle Service: shuffle clients send shuffle records to remote shuffle service in a streaming style, and shuffle service writes the records to different partition files based on the records’ partition id.

  2. Async Shuffle File Upload: shuffle clients write shuffle records to local disk first, then upload/merge shuffle files to remote storage asynchronously.

This project focuses on the first option to illustrate how to design and implement a streaming based remote shuffle service. See this High Level Design Doc for a quick overview.

Clone this wiki locally