Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks 13 support #3334

Closed
RobinLoche opened this issue May 12, 2023 · 6 comments
Closed

Databricks 13 support #3334

RobinLoche opened this issue May 12, 2023 · 6 comments
Assignees
Labels
connection databricks Issues related to Databricks connection mode high-pri

Comments

@RobinLoche
Copy link

Starting from Databricks DBR 13 and the associated python package databricks-connect 13.0.0, the connection now use the new Spark Connect.

It seems to change a few things in the connection, and Sparklyr doesn't seems to be compatible.

For example, if I use the traditionnal way to connect, which works with older version, I get this:

> sc <- spark_connect(method = "databricks", spark_home = system("databricks-connect get-spark-home", intern = TRUE))
Error in system2(file.path(spark_home, "bin", "spark-submit"), "--version",  : 
  error in running command

I tested this connection and it works in python, my guess is that it's a big change in the internal way to connect that is not compatible with Sparklyr (I have the version 1.8.1).

I expected that seeing it's brand new, but my question is: is there a support planned for this new databricks-connect for Sparklyr ? Or do I have to switch to ODBC (which lack features compared to Sparklyr) ?

@gregleleu
Copy link
Contributor

Had a look at Spark Connect.
From what I understand the client sessions needs to be built using:

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()

(https://spark.apache.org/docs/latest/spark-connect-overview.html)

I couldn't get it to work, "remote" is not found as a member of builder. Apparently the org.apache.spark:spark-connect-client-jvm_2.12:3.4.0 needs to be imported beforehand to add that method. But I don't know how that works. Is that something that the sparkly jar needs to do?
(https://stackoverflow.com/questions/76209004/using-spark-connect-with-scala)

More generally, this creates a local spark, which talks to a remote one. Would the point be that the R sessions talks directly to the remote spark? Do we need Spark Connect for that?
R <--sparklyr--> Local Spark <--spark connect--> Remote Spark
vs
R <--sparklyr--> Remote Spark

@edgararuiz
Copy link
Collaborator

Hi all, this is something that I'm working on at this time. Bottom line is that, at this time, sparklyr does not work with Spark Connect or Databricks Connect.

@edgararuiz
Copy link
Collaborator

Hi all, yesterday I merged the needed changes in sparklyr to support a new extension package, called pysparklyr, that will enable integration with Spark Connect and Databricks connect. Instructions on how to use are in the README of the new package: https://github.com/mlverse/pysparklyr . Please take into account that MLib is not implemented for Connect in version 3.4, so we are limited to SQL based commands at this time. Once ML capabilities are added in 3.5, we'll start working on adding those wrappers

@edgararuiz
Copy link
Collaborator

For reference, I have a tracking document in the repo that contains what's supported and what is not. I'd be curious to see if there's anything there that is not supported by pysparklyr, that you would like being implemented faster. Of course, aside from functionality that Connect does not support yet (ML, SDF) :) https://github.com/mlverse/pysparklyr/blob/main/progress.md

@edgararuiz
Copy link
Collaborator

There is now a guide to using sparklyr in Spark Connect / Databricks Connect: https://spark.rstudio.com/deployment/databricks-spark-connect

@edgararuiz
Copy link
Collaborator

Closing issue now since the new solution, of using pysparklyr, is now in its 4th release on CRAN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connection databricks Issues related to Databricks connection mode high-pri
Projects
None yet
Development

No branches or pull requests

3 participants