New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparklyr and SparkR - the future? #502

Closed
kevinykuo opened this Issue Feb 18, 2017 · 2 comments

Comments

Projects
None yet
3 participants
@kevinykuo
Collaborator

kevinykuo commented Feb 18, 2017

Just checked out @nwstephens's talk at Spark Summit East where he answers some questions regarding sparklyr vs. SparkR towards the end. I've been thinking about this and am just a little bit concerned about how things might play out.

While I'm a fan of having different ways to solve a problem, sometimes it gets in the way of collaboration. As an example, most of the work I've done involves data manipulation/transformation, and it's been difficult for dplyr and data.table "speakers" to work together on the same task. What I'm worried about is that we'll have a "sparklyr camp" and a "SparkR camp" come next year, and we'll further factionalize the community and in turn discourage new data scientists from picking up R.

Would be interested to get some thoughts from the main contributors and results of discussions with the SparkR folks.

@kevinushey

This comment has been minimized.

Contributor

kevinushey commented Feb 21, 2017

Some of the main motivations behind why we decided to write sparklyr, rather than contribute to SparkR:

  1. SparkR, rather than implementing the dplyr interface (e.g. defining S3 methods for mutate() and friends), chooses to override / mask those functions with their own equivalents. We felt sticking with S3 and implementing methods for mutate() and friends fits better with the dplyr / tidyverse philosophy.

  2. A number of SparkR routines require supporting code in the main Spark Scala sources (thereby coupling their code base a bit more to Spark's internals than we are); we wanted to make sure that sparklyr could work with Scala extensions to Spark that lived independent of Spark's core.

  3. SparkR (at least at the time when we first looked at it) wanted to tie the development + release of their package with certain versions of Spark; e.g. SparkR v1.6 was needed for Spark 1.6; SparkR v2.0 was required for Spark 2.0, and so on. We wanted sparklyr to 'just work' regardless of what version(s) of Spark you had installed.

  4. We wanted to make it easy for users to define Spark extensions (using Scala code) and release those extensions as part of R packages.

All in all, we wanted sparklyr to play well with other packages in the R ecosystem, and also provide a bit more freedom to users to extend Spark (and call into the Spark Scala API) as desired.

That said, we certainly haven't reached that goal yet (there are things SparkR does better than sparklyr currently; for example, parallel execution of R code across Spark nodes) but we hope to get there in the future. And while you won't be able to (easily) write code that uses SparkR and sparklyr at the same time, there's nothing stopping users from using them independently to mutate datasets in the same data store, so I think it's still an overall net win for users.

@kevinykuo kevinykuo closed this May 10, 2017

@javierluraschi

This comment has been minimized.

Member

javierluraschi commented Nov 20, 2018

Follow up regarding:

That said, we certainly haven't reached that goal yet (there are things SparkR does better than sparklyr currently; for example, parallel execution of R code across Spark nodes) but we hope to get there in the future.`

This is no longer the case, later in 2017 we introduced spark_apply() with sparklyr 0.6. Late 2018 developments while adding support for arrow are showing that spark_apply() is orders of magnitude faster than the SparkR variations, at least, from the measurements we gathered. So at this point, there are no longer features I'm aware sparklyr is missing, definitely, please open feature requests under https://github.com/rstudio/sparklyr/issues/ if there are feature we still need to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment