-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Arrow #1611
Implement Arrow #1611
Conversation
c6770e9
to
6ab9709
Compare
4d5d4ae
to
3c253e5
Compare
ba12f2f
to
2ca93ae
Compare
2ca93ae
to
2deea96
Compare
huzzah! |
@javierluraschi would you be interested in doing a write up for the Apache Arrow blog about this work, including all the benchmark results? |
@wesm yes, for sure. However, I'm not considering this work complete, mostly due to arrow_data.R#L21, since I'm currently tuning off arrow for the unsupported data types, we have dates almost figured out but nested data is also missing. I'm also investigating larger copy/collect use cases by tweaking batches. So, we could write a "preliminary results" post in your blog mentioning these caveats and the current state of this work, or we could wait until we push everything to CRAN, which is probably a couple months away, or do both posts. What's your take? |
I recommend a blog much sooner as a means of also drumming up community involvement. |
@wesm Makes sense. How do I send you a blog post? |
You can do it as a pull request to the site/ directory in the Arrow repo |
@wesm here is a draft post: apache/arrow#3001 |
Support for Apache Arrow in
sparklyr
.Benchmarks
For completeness, adding
sparkR
, which gets initialized as:Copying
Collecting
Running this benchmark with
10^6
entries shows improvements underarrow
,spark_apply()
Notice that JIT was turned off since it adds a bit of overhead in
spark_apply()
for this particular example, here is a detailed comparison between JIT enabled/disabled witharrow
:Here is a profile measuring time spent while running
spark_apply()
, loadingarrow
seems to take260ms
which could be worth investigating further at some point:Comparing with
scala
:Tests
From the Travis run performance results, we can compare execution against arrow as follows:
Overall,
arrow
tests execute faster than thesparklyr
serializer, Travis tests use only small datasets but help ensure unnecessary overhead is not being introduced.