Set environment variables for `spark_apply()` #915

chezou · 2017-08-10T07:50:02Z

Since Rscript requires to set environment variables like RHOME if Rscript is moved after installation. SparkR is able to set them as --conf options (see also). If we have this function, we can use Parcel based installation of R.

I'm creating R Parcel for CDH to run spark_apply() easily because CDH doesn't have R on worker nodes by default.

I also tried to run following code, but Rscript execution error: No such file or directory error was returned. I guess ProcessBuilder does not handle environment variables with this way.

library(sparklyr)

config <- spark_config()
config[["spark.r.command"]] <- paste(
  "R_HOME=/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R",
  "RHOME=/opt/cloudera/parcels/CONDAR/lib/conda-R",
  "R_SHARE_DIR=/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/share",
  "R_INCLUDE_DIR=/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/include",
  "/opt/cloudera/parcels/CONDAR/lib/conda-R/bin/Rscript",
  sep=" ")

sc <- spark_connect(master = "yarn-client", config = config)

sdf_len(sc, 5, repartition = 1) %>%
  spark_apply(function(e) I(e))

The text was updated successfully, but these errors were encountered:

javierluraschi · 2017-08-10T21:07:14Z

@chezou good point, I'll add a sparklyr.apply.env configuration setting to pass a list, say list(R_HOME = "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R") when we run the Rscript, I'll get this done today.

Thanks for putting that parcel together! Would you mind sharing details on how to help testing that parcel on our end?

chezou · 2017-08-10T23:26:44Z

@javierluraschi I added step by step instruction for putting parcels https://github.com/chezou/cloudera-parcel

I also created a parcel for RHEL7. You can set http://se-jp-parcel.s3-website-ap-northeast-1.amazonaws.com/ for testing. Note that it is temporary parcel repo.

To know adding parcels on CDH, this blog post will help you.
https://blogs.msdn.microsoft.com/pliu/2016/06/19/run-jupyter-notebook-on-cloudera/

chezou · 2017-08-11T01:43:28Z

https://github.com/chezou/cloudera-parcel/blob/master/build.sh
Added build script for RHEL7.

javierluraschi · 2017-08-11T17:37:33Z

@chezou that's great, thanks for the steps! When you mention this is a temporary repo, should we be considering publishing your parcel in a more official repo? In the meantime, we will try this out on our end as well and report back!

chezou · 2017-08-12T09:26:01Z

Finally, it works with upstream sparklyr! Thanks for your quick response @javierluraschi .

It would be nice to publish my parcel in your repo, but it is AS-IS, and I can't guaranty my supporting, think it as a community parcel.

chezou · 2017-08-19T08:09:25Z

Just FYI, after getting the environment variable option, following way which uses conda environment works fine. It distributes r environment for each spark job.

$ conda create -p ~/r_env --copy -y -q r-essentials icu -c r
$ sed -i "s,/home/<USERNAME>,./r_env.zip,g" r_env/bin/R 
$ zip -r r_env.zip r_env

library(sparklyr)

config <- spark_config()
config[["spark.r.command"]] <- "./r_env.zip/r_env/bin/Rscript"
config[["spark.yarn.dist.archives"]] <- "r_env.zip"
config$sparklyr.apply.env.R_HOME <- "./r_env.zip/r_env/lib/R"
config$sparklyr.apply.env.RHOME <- "./r_env.zip/r_env"
config$sparklyr.apply.env.R_SHARE_DIR <- "./r_env.zip/r_env/lib/R/share"
config$sparklyr.apply.env.R_INCLUDE_DIR <- "./r_env.zip/r_env/lib/R/include"


sc <- spark_connect(master = "yarn-client", config = config)

sdf_len(sc, 5, repartition = 1) %>%
  spark_apply(function(e) I(e))

chezou · 2017-08-19T08:31:28Z

It works fine to run broom package with package=FALSE option; otherwise, it makes different libicui18n.so dependency problem. chezou/cloudera-parcel#2

iris_tbl <- sdf_copy_to(sc, iris)

spark_apply(
  iris_tbl,
  function(e) broom::tidy(lm(Petal_Length ~ Petal_Width, e)),
  names = c("term", "estimate", "std.error", "statistic", "p.value"),
  group_by = "Species",
  packages = FALSE)

javierluraschi added the distributed r label Aug 10, 2017

javierluraschi mentioned this issue Aug 10, 2017

Add support to initialize env vars to spark_apply() #918

Merged

chezou closed this as completed Aug 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set environment variables for `spark_apply()` #915

Set environment variables for `spark_apply()` #915

chezou commented Aug 10, 2017 •

edited

javierluraschi commented Aug 10, 2017

chezou commented Aug 10, 2017

chezou commented Aug 11, 2017

javierluraschi commented Aug 11, 2017

chezou commented Aug 12, 2017 •

edited

chezou commented Aug 19, 2017

chezou commented Aug 19, 2017

Set environment variables for spark_apply() #915

Set environment variables for spark_apply() #915

Comments

chezou commented Aug 10, 2017 • edited

javierluraschi commented Aug 10, 2017

chezou commented Aug 10, 2017

chezou commented Aug 11, 2017

javierluraschi commented Aug 11, 2017

chezou commented Aug 12, 2017 • edited

chezou commented Aug 19, 2017

chezou commented Aug 19, 2017

Set environment variables for `spark_apply()` #915

Set environment variables for `spark_apply()` #915

chezou commented Aug 10, 2017 •

edited

chezou commented Aug 12, 2017 •

edited