Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow #1457

Closed
javierluraschi opened this issue Apr 30, 2018 · 8 comments
Closed

Arrow #1457

javierluraschi opened this issue Apr 30, 2018 · 8 comments
Assignees

Comments

@javierluraschi
Copy link
Collaborator

javierluraschi commented Apr 30, 2018

Related

@mattpollock
Copy link
Contributor

Will this work apply both to spark_apply and to an R version of toPandas? My (limited) understanding is that toPandas uses arrow but that the python version of collect does basically the same thing the R version does.

@javierluraschi
Copy link
Collaborator Author

@mattpollock yes, my intention is to replace all serialization between R and Scala with arrow, with serialization being used in copy_to(), collect() and spark_apply().

@russellpierce
Copy link
Contributor

@javierluraschi Is the expectation then that copy_to() would become suitable for bulk data transfer or would be best pattern still be to get the data off in to a disk storage format for Spark?

@javierluraschi javierluraschi moved this from Wishlist to In Progress in SparklyBoard Jul 16, 2018
@javierluraschi
Copy link
Collaborator Author

@russellpierce copy_to() will never be as efficient as dedicated tools for loading data into a cluster; however, currently, copy_to() has narrow application since it can't really handle even "medium" size data, a few megabytes seems to be the limit. So my hope is that among many improvements introduced by a single data representation with Arrow, we can improve performance and support higher loads. I'm planning to spend this week doing some investigation, should be able to open a WIP PR and share progress and take feedback.

@randomgambit
Copy link

@javierluraschi that looks really amazing. will this be shipped with sparklyr 0.9?

@javierluraschi javierluraschi moved this from In Progress to Done in SparklyBoard Nov 20, 2018
@awblocker
Copy link
Contributor

Wanted to check in out timeline for arrow serialization for spark_apply. Glad to help with testing or other open items. My team has a set of workloads (far more groups than executors) where this looks like it will have a very large impact.

@javierluraschi
Copy link
Collaborator Author

@awblocker arrow is already implemented; however, we are waiting for the Arrow 0.12 release to make this a bit more official. For now, you can test this out using the instructions from:

#1611

@skattoor
Copy link

skattoor commented Jul 3, 2019

After reading the various issues related to this, I'm not quite sure what's the current status : I've installed SparklyR v1.0.1.9004 along with Arrow v0.13.0 and spark_apply still shows the described slowness.

Is there a version I should upgrade to to solve this issue?

Any help much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants