Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes #1525

Closed
javierluraschi opened this issue May 28, 2018 · 5 comments

Comments

@javierluraschi
Copy link
Member

commented May 28, 2018

One can launch Spark in Kubernetes from sparklyr as:

sc <- spark_connect(
  master = "k8s://http://127.0.0.1:8001"
  config = list(
    spark.executor.instances = 2,
    spark.kubernetes.container.image = "spark-image"
  )
)

However, connectivity to Kubernetes would be blocked since sparklyr would not be able to find the application master.

See also https://spark.apache.org/docs/latest/running-on-kubernetes.html

@javierluraschi javierluraschi created this issue from a note in SparklyBoard (Wishlist) May 28, 2018

@javierluraschi javierluraschi self-assigned this Jun 12, 2018

@javierluraschi

This comment has been minimized.

Copy link
Member Author

commented Jul 13, 2018

Here is some initial ongoing investigation to get this working, not ready for consumption...

In sparklyr you can run something similar to spark with kubernetes to launch a Spark Kubernetes cluster; afterwards, sparklyr needs to find out which is the driver node using kubectl and connects to the driver node. Once connected, sparklyr should work in the same way as when working with any other cluster manager: Yarn, Mesos, etc.

To use kubernetes locally:

  • Install the kubernetes-cli and minikube, from OS X you can run: brew install kubernetes-cli minikube.
  • Run minikube start as described in the quickstart
  • Get the master address by running kubectl cluster-info.
  • Create a sparklyr folder in SPARK_HOME and copy the sparklyr jars from ``.
  • Modify the Spark kubernetes/dockerfiles/spark/Dockerfile to create a /opt/sparklyr path and copy the sparklyr jars. For instance by adding mkdir -p /opt/sparklyr && \ and COPY sparklyr /opt/sparklyr.
  • To prevent Forbidden!Configured service account doesn't have access generate service account: kubectl create serviceaccount spark and $ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default.
  • Generate the cluster docker images by running from SPARK_HOME: ./bin/docker-image-tool.sh -m -t sparklyr build. However, you might hit an error with a workaround available in this post.

Launch sparklyr as follows:

sc <- spark_connect(
  master = "k8s://https://192.168.99.100:8443",
  config = list(
    "sparklyr.shell.master" = "k8s://https://192.168.99.100:8443",
    "sparklyr.shell.deploy-mode" = "cluster",
    "sparklyr.gateway.remote" = TRUE,
    "sparklyr.shell.name" = "sparklyr",
    "sparklyr.shell.class" = "sparklyr.Shell",
    "sparklyr.shell.conf" = c(
      "spark.kubernetes.container.image=spark:sparklyr",
      "spark.kubernetes.driver.pod.name=spark-pi-driver",
      "spark.kubernetes.authenticate.driver.serviceAccountName=spark"
    ),
    "sparklyr.app.jar" = "local:///opt/sparklyr/sparklyr-2.3-2.11.jar"
  ),
  spark_home = spark_home_dir()
)

Resources:

@javierluraschi javierluraschi moved this from Wishlist to In Progress in SparklyBoard Jul 14, 2018

@javierluraschi javierluraschi referenced this issue Jul 14, 2018
3 of 3 tasks complete

@javierluraschi javierluraschi moved this from In Progress to Done in SparklyBoard Jul 16, 2018

@kevinykuo kevinykuo added this to the 0.9.0 milestone Jul 18, 2018

@javierluraschi

This comment has been minimized.

Copy link
Member Author

commented Jul 18, 2018

Implemented with #1599

@kkeenan02

This comment has been minimized.

Copy link

commented Sep 13, 2018

@javierluraschi Thank you for making the effort to build kubernetes support into sparklyr.

I'm trying to understand how to implement this for myself, and am hoping you could shed a bit more light, please. I am trying to connect to a remote spark master on the same kubernetes cluster as my rstudio-server instance (with sparklyr installed), with the aim of being able to execute sparklyr commands against the spark cluster from within rstudio. I have added the sparklyr jars to the spark container, but when I try to run the spark_connect code above (substituting my local urls etc.), I get the following error

Error in shell_connection(master = master, spark_home = spark_home, app_name = app_name,  : 
  Failed to connect to Spark (SPARK_HOME is not set).

Am I misunderstanding something fundamental here? For example, does the spark master and rstudio instance have to co-exist on the same container?

Any insight you can provide would be greatly appreciated.

Kevin

@javierluraschi

This comment has been minimized.

Copy link
Member Author

commented Oct 1, 2018

@kkeenan02 this work was only to enable connection to new kubernetes clusters... if you have an existing kubernetes clusters, you probably don't even need this work, let me elaborate.

One option to use Kubernetes is for someone to start a Spark cluster, usually your system admin or someone in your team, you can use Kubernetes to start a team cluster running Yarn, Cloudera or why not. Then you can connect to this cluster from sparklyr as "usual", the recommended approach is to also start RStudio Server while creating the kubernetes cluster, such that, once the cluster starts, you can navigate to a URL with RStudio Server and connect using sc <- spark_connect(master = "local"), I'm assuming Yarn is running inside kubernetes, but spark standalone or other options could be in place. For this use case, you don't even need the work of sparklyr 0.9. From what you mention, sounds like you might be in this situation.

The other option which this PR enables in sparklyr 0.9, is for cases where a generic kubernetes cluster is provided, read a bunch of machines without anything installed. In these cases, you can create your own VM image for the cluster that while connecting with spark_connect() triggers the creation of Spark inside the cluster by initializing the right VMs with the right image and then establishing the connection. To be honest, this is a pretty advance use case, so again, my guess is that you are on the former use case and you don't need to worry about where the cluster is actually running.

@alokgogate

This comment has been minimized.

Copy link

commented Apr 16, 2019

@javierluraschi, I was trying the other option that you have mentioned with version of sparklyr 1.0.0 with spark (2.4.0) running on kubernetes and yet face the issue that @kkeenan02 has.
Here is my code so far:

config <- spark_config_kubernetes(master="k8s://192.168.1.1:6443",
                              config = list(
                                spark.submit.deployMode = "client",
                                spark.kubernetes.namespace = "spark-project1",
                                spark.driver.host = "192.168.1.1",
                                spark.driver.port = 7787,
                                spark.kubernetes.container.image = "http://192.168.1.1:5000/spark/spark-r",
                                spark.executor.instances = 2,
                                spark.kubernetes.driver.request.cores = 32,
                                spark.kubernetes.driver.limit.cores = 32,
                                spark.kubernetes.executor.request.cores = 12,
                                spark.kubernetes.executor.limit.cores = 13,
                                spark.executor.core = 12,
                                spark.executor.memory = "20G",
                                spark.kubernetes.authenticate.driver.serviceAccountName = "spark"
                              ))
 

sc <- spark_connect(config = config)

Any suggestions as to what's wrong here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.