Arrow #1457

javierluraschi · 2018-04-30T16:26:04Z

What are the Apache Arrow integration plans? #1327
Can't pull result back from Spark #1289
Arrow #1457
copy_to fails on nested data #1536 (copy_to with nested data)

The text was updated successfully, but these errors were encountered:

mattpollock · 2018-05-01T18:32:59Z

Will this work apply both to spark_apply and to an R version of toPandas? My (limited) understanding is that toPandas uses arrow but that the python version of collect does basically the same thing the R version does.

javierluraschi · 2018-06-28T07:25:09Z

@mattpollock yes, my intention is to replace all serialization between R and Scala with arrow, with serialization being used in copy_to(), collect() and spark_apply().

russellpierce · 2018-07-02T21:18:49Z

@javierluraschi Is the expectation then that copy_to() would become suitable for bulk data transfer or would be best pattern still be to get the data off in to a disk storage format for Spark?

javierluraschi · 2018-07-17T00:54:42Z

@russellpierce copy_to() will never be as efficient as dedicated tools for loading data into a cluster; however, currently, copy_to() has narrow application since it can't really handle even "medium" size data, a few megabytes seems to be the limit. So my hope is that among many improvements introduced by a single data representation with Arrow, we can improve performance and support higher loads. I'm planning to spend this week doing some investigation, should be able to open a WIP PR and share progress and take feedback.

randomgambit · 2018-09-21T19:37:13Z

@javierluraschi that looks really amazing. will this be shipped with sparklyr 0.9?

awblocker · 2018-12-18T17:10:55Z

Wanted to check in out timeline for arrow serialization for spark_apply. Glad to help with testing or other open items. My team has a set of workloads (far more groups than executors) where this looks like it will have a very large impact.

javierluraschi · 2018-12-18T21:12:35Z

@awblocker arrow is already implemented; however, we are waiting for the Arrow 0.12 release to make this a bit more official. For now, you can test this out using the instructions from:

#1611

skattoor · 2019-07-03T15:32:37Z

After reading the various issues related to this, I'm not quite sure what's the current status : I've installed SparklyR v1.0.1.9004 along with Arrow v0.13.0 and spark_apply still shows the described slowness.

Is there a version I should upgrade to to solve this issue?

Any help much appreciated!

javierluraschi created this issue from a note in SparklyBoard (Wishlist) Apr 30, 2018

javierluraschi added the featurerequest label Apr 30, 2018

javierluraschi mentioned this issue Apr 30, 2018

What are the Apache Arrow integration plans? #1327

Closed

kevinykuo mentioned this issue May 2, 2018

more efficient csv_string serialization #1464

Closed

javierluraschi mentioned this issue Jun 26, 2018

Improve Serialization #941

Closed

javierluraschi moved this from Wishlist to In Progress in SparklyBoard Jul 16, 2018

javierluraschi self-assigned this Aug 6, 2018

This was referenced Sep 10, 2018

Can't pull result back from Spark #1289

Closed

spark_apply is very slow #1526

Closed

copy_to fails on nested data #1536

Closed

javierluraschi moved this from In Progress to Done in SparklyBoard Nov 20, 2018

yitao-li closed this as completed Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow #1457

Arrow #1457

javierluraschi commented Apr 30, 2018 •

edited

mattpollock commented May 1, 2018

javierluraschi commented Jun 28, 2018

russellpierce commented Jul 2, 2018

javierluraschi commented Jul 17, 2018

randomgambit commented Sep 21, 2018

awblocker commented Dec 18, 2018

javierluraschi commented Dec 18, 2018

skattoor commented Jul 3, 2019

Arrow #1457

Arrow #1457

Comments

javierluraschi commented Apr 30, 2018 • edited

mattpollock commented May 1, 2018

javierluraschi commented Jun 28, 2018

russellpierce commented Jul 2, 2018

javierluraschi commented Jul 17, 2018

randomgambit commented Sep 21, 2018

awblocker commented Dec 18, 2018

javierluraschi commented Dec 18, 2018

skattoor commented Jul 3, 2019

javierluraschi commented Apr 30, 2018 •

edited