Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add shuffle reuse feature #3908

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions src/docs/ocean-spark/tools-integrations/shuffle-plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Shuffle data reuse

Shuffle data reuse is a feature that writes Spark shuffle data to a shared remote filesystem, such as S3.
This allows reusing shuffle data from failed Spark tasks avoiding task retries.
This feature is also useful with dynamic allocation enabled,
as it allows scaling down Spark executors that are kept running solely because of the data they contain.
Reusing the shuffle data can save time and resources.

## Configuration

To enable shuffle data reuse, set the following configuration in your Spark application:

```json
{
"shuffle": {
"enabled": "true",
"rootDir": "s3a://<bucket>/path/to/shuffle"
}
}
```

The `shuffle.rootdir` configuration is the location where the shuffle data will be written.
The shuffle reuse feature uses hadoop filesystem to write the shuffle data, and as such supports any filesystem that hadoop supports.
The rootdir option can be a local path, HDFS path, or any other hadoop supported filesystem.
A shared remote drive such as FSx or S3 CSI, must be mounted on all the executors in the cluster when using a local path.

For instance

```json
{
"shuffle": {
"rootDir": "/opt/spark/work-dir/shuffle"
},
"volumes": [
{
"name": "spark-data",
"persistentVolumeClaim": {
"claimName": "s3-claim"
}
}
],
"driver": {
"volumeMounts": [
{
"mountPath": "/opt/spark/work-dir/shuffle",
"name": "spark-data"
}
]
},
"executor": {
"volumeMounts": [
{
"mountPath": "/opt/spark/work-dir/shuffle",
"name": "spark-data"
}
]
}
}
```

## Limitations

- The shuffle data reuse feature is only available for Spark 3.2 and later.
- Preferably set the spark.dynamicAllocation.shuffleTracking.enabled to false when using shuffle data reuse feature.
Loading