Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set environment variables for spark_apply() #915

Closed
chezou opened this issue Aug 10, 2017 · 7 comments
Closed

Set environment variables for spark_apply() #915

chezou opened this issue Aug 10, 2017 · 7 comments

Comments

@chezou
Copy link
Contributor

chezou commented Aug 10, 2017

Since Rscript requires to set environment variables like RHOME if Rscript is moved after installation. SparkR is able to set them as --conf options (see also). If we have this function, we can use Parcel based installation of R.

I'm creating R Parcel for CDH to run spark_apply() easily because CDH doesn't have R on worker nodes by default.

I also tried to run following code, but Rscript execution error: No such file or directory error was returned. I guess ProcessBuilder does not handle environment variables with this way.

library(sparklyr)

config <- spark_config()
config[["spark.r.command"]] <- paste(
  "R_HOME=/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R",
  "RHOME=/opt/cloudera/parcels/CONDAR/lib/conda-R",
  "R_SHARE_DIR=/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/share",
  "R_INCLUDE_DIR=/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/include",
  "/opt/cloudera/parcels/CONDAR/lib/conda-R/bin/Rscript",
  sep=" ")

sc <- spark_connect(master = "yarn-client", config = config)

sdf_len(sc, 5, repartition = 1) %>%
  spark_apply(function(e) I(e))
@javierluraschi
Copy link
Collaborator

@chezou good point, I'll add a sparklyr.apply.env configuration setting to pass a list, say list(R_HOME = "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R") when we run the Rscript, I'll get this done today.

Thanks for putting that parcel together! Would you mind sharing details on how to help testing that parcel on our end?

@chezou
Copy link
Contributor Author

chezou commented Aug 10, 2017

@javierluraschi I added step by step instruction for putting parcels https://github.com/chezou/cloudera-parcel

I also created a parcel for RHEL7. You can set http://se-jp-parcel.s3-website-ap-northeast-1.amazonaws.com/ for testing. Note that it is temporary parcel repo.

To know adding parcels on CDH, this blog post will help you.
https://blogs.msdn.microsoft.com/pliu/2016/06/19/run-jupyter-notebook-on-cloudera/

@chezou
Copy link
Contributor Author

chezou commented Aug 11, 2017

https://github.com/chezou/cloudera-parcel/blob/master/build.sh
Added build script for RHEL7.

@javierluraschi
Copy link
Collaborator

@chezou that's great, thanks for the steps! When you mention this is a temporary repo, should we be considering publishing your parcel in a more official repo? In the meantime, we will try this out on our end as well and report back!

@chezou
Copy link
Contributor Author

chezou commented Aug 12, 2017

Finally, it works with upstream sparklyr! Thanks for your quick response @javierluraschi .

image

It would be nice to publish my parcel in your repo, but it is AS-IS, and I can't guaranty my supporting, think it as a community parcel.

@chezou chezou closed this as completed Aug 12, 2017
@chezou
Copy link
Contributor Author

chezou commented Aug 19, 2017

Just FYI, after getting the environment variable option, following way which uses conda environment works fine. It distributes r environment for each spark job.

$ conda create -p ~/r_env --copy -y -q r-essentials icu -c r
$ sed -i "s,/home/<USERNAME>,./r_env.zip,g" r_env/bin/R 
$ zip -r r_env.zip r_env
library(sparklyr)

config <- spark_config()
config[["spark.r.command"]] <- "./r_env.zip/r_env/bin/Rscript"
config[["spark.yarn.dist.archives"]] <- "r_env.zip"
config$sparklyr.apply.env.R_HOME <- "./r_env.zip/r_env/lib/R"
config$sparklyr.apply.env.RHOME <- "./r_env.zip/r_env"
config$sparklyr.apply.env.R_SHARE_DIR <- "./r_env.zip/r_env/lib/R/share"
config$sparklyr.apply.env.R_INCLUDE_DIR <- "./r_env.zip/r_env/lib/R/include"


sc <- spark_connect(master = "yarn-client", config = config)

sdf_len(sc, 5, repartition = 1) %>%
  spark_apply(function(e) I(e))

@chezou
Copy link
Contributor Author

chezou commented Aug 19, 2017

It works fine to run broom package with package=FALSE option; otherwise, it makes different libicui18n.so dependency problem. chezou/cloudera-parcel#2

iris_tbl <- sdf_copy_to(sc, iris)

spark_apply(
  iris_tbl,
  function(e) broom::tidy(lm(Petal_Length ~ Petal_Width, e)),
  names = c("term", "estimate", "std.error", "statistic", "p.value"),
  group_by = "Species",
  packages = FALSE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants