Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark_write_csv fails when connecting to Spark as a user that does not start the cluster #3284

Closed
s-geissler opened this issue Aug 12, 2022 · 2 comments

Comments

@s-geissler
Copy link

I am running a standalone spark cluster using Spark 3.3.0 consisting of 1 master and 4 worker nodes. I am starting the cluster from user ubuntu on all 5 involved nodes. All nodes have access to the same storage via NFS, provided by the master.

In my environment, I have multiple users that need to connect to the cluster and submit spark applications. None of those users is the ubuntu user that starts the cluster master and workers. Here, I use the test user to log into RStudio Server and run the Spark application via sparklyr.

# Load libs
library(sparklyr)
library(tidyverse)


config <- spark_config()
config$spark.executor.memory = "5G"
config$spark.executor.cores = "2"
config$spark.shuffle.service.enabled = T
config$spark.dynamicAllocation.enabled = T

Sys.setenv(SPARK_HOME="/opt/spark/",
                 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/")

sc <- spark_connect(master = "spark://172.16.44.70:7077",
                    config = config,
                    app_name = "app-test")

sdf_data <- spark_read_parquet(sc, name = "sdf_data", path = "/mnt/storage/example_data", memory = F)

### Everything works until here
### The following fails

spark_write_csv(sdf_data %>% head(100), path = "/mnt/storage/test-data/")

The resulting error is as follows.

Error: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 4 times, most recent failure: Lost task 0.3 in stage 38.0 (TID 2255) (172.16.44.80 executor 309): java.io.IOException: Mkdirs failed to create file:/mnt/storage/test-data/_temporary/0/_temporary/attempt_202208121311203632496408213281273_0038_m_000000_2255 (exists=false, cwd=file:/home/ubuntu/spark-3.3.0-bin-hadoop3/work/app-20220812114226-0003/309)
...

The underlying issue is, that the destination folder /mnt/storage/test-data is created by the R user instead of the cluster user. Hence, the cluster is unable to actually write the data into the folder.

test@compute1:/mnt/storage$ ls -l
total 124
drwxrwxrwx   2 ubuntu   spark      73728 Aug 12 09:17 example_data
drwxr-xr-x   3 test     test       4096  Aug 12 13:10 test-data

I already tried to set default ownerships and file permissions to the folder I am trying to write to using Linux ACL settings, but it seems R (or sparklyr?) does its own thing and ignores the actual permission settings.

Is there a known workaround for this issue? How can I support multiple users on top of regular file storage without having to spin up dedicated clusters for every user?

@edgararuiz
Copy link
Collaborator

Hi @s-geissler , does the data created by spark_write_csv() needs to be accessible available to everyone in the cluster? If so, can you give me more background on the workflow for this team?

@github-actions
Copy link

Automatically closed because there has not been a response for 30 days. When you're ready to work on this further, please comment here and the issue will automatically reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants