spark_write_csv fails when connecting to Spark as a user that does not start the cluster #3284

s-geissler · 2022-08-12T15:57:03Z

I am running a standalone spark cluster using Spark 3.3.0 consisting of 1 master and 4 worker nodes. I am starting the cluster from user ubuntu on all 5 involved nodes. All nodes have access to the same storage via NFS, provided by the master.

In my environment, I have multiple users that need to connect to the cluster and submit spark applications. None of those users is the ubuntu user that starts the cluster master and workers. Here, I use the test user to log into RStudio Server and run the Spark application via sparklyr.

# Load libs
library(sparklyr)
library(tidyverse)


config <- spark_config()
config$spark.executor.memory = "5G"
config$spark.executor.cores = "2"
config$spark.shuffle.service.enabled = T
config$spark.dynamicAllocation.enabled = T

Sys.setenv(SPARK_HOME="/opt/spark/",
                 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/")

sc <- spark_connect(master = "spark://172.16.44.70:7077",
                    config = config,
                    app_name = "app-test")

sdf_data <- spark_read_parquet(sc, name = "sdf_data", path = "/mnt/storage/example_data", memory = F)

### Everything works until here
### The following fails

spark_write_csv(sdf_data %>% head(100), path = "/mnt/storage/test-data/")

The resulting error is as follows.

Error: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 4 times, most recent failure: Lost task 0.3 in stage 38.0 (TID 2255) (172.16.44.80 executor 309): java.io.IOException: Mkdirs failed to create file:/mnt/storage/test-data/_temporary/0/_temporary/attempt_202208121311203632496408213281273_0038_m_000000_2255 (exists=false, cwd=file:/home/ubuntu/spark-3.3.0-bin-hadoop3/work/app-20220812114226-0003/309)
...

The underlying issue is, that the destination folder /mnt/storage/test-data is created by the R user instead of the cluster user. Hence, the cluster is unable to actually write the data into the folder.

test@compute1:/mnt/storage$ ls -l
total 124
drwxrwxrwx   2 ubuntu   spark      73728 Aug 12 09:17 example_data
drwxr-xr-x   3 test     test       4096  Aug 12 13:10 test-data

I already tried to set default ownerships and file permissions to the folder I am trying to write to using Linux ACL settings, but it seems R (or sparklyr?) does its own thing and ignores the actual permission settings.

Is there a known workaround for this issue? How can I support multiple users on top of regular file storage without having to spin up dedicated clusters for every user?

The text was updated successfully, but these errors were encountered:

edgararuiz · 2023-06-20T21:58:23Z

Hi @s-geissler , does the data created by spark_write_csv() needs to be accessible available to everyone in the cluster? If so, can you give me more background on the workflow for this team?

github-actions · 2023-07-21T06:04:58Z

Automatically closed because there has not been a response for 30 days. When you're ready to work on this further, please comment here and the issue will automatically reopen.

edgararuiz added the awaiting response label Jun 20, 2023

github-actions bot closed this as completed Jul 21, 2023

trhallam mentioned this issue Sep 7, 2023

Spark standalone write permissions failure. #3376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark_write_csv fails when connecting to Spark as a user that does not start the cluster #3284

spark_write_csv fails when connecting to Spark as a user that does not start the cluster #3284

s-geissler commented Aug 12, 2022

edgararuiz commented Jun 20, 2023

github-actions bot commented Jul 21, 2023

spark_write_csv fails when connecting to Spark as a user that does not start the cluster #3284

spark_write_csv fails when connecting to Spark as a user that does not start the cluster #3284

Comments

s-geissler commented Aug 12, 2022

edgararuiz commented Jun 20, 2023

github-actions bot commented Jul 21, 2023