Implement Arrow #1611

javierluraschi · 2018-07-20T00:18:41Z

Support for Apache Arrow in sparklyr.

# Install this PR
devtools::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")
devtools::install_github("rstudio/sparklyr")

# Initialize Data
df <- data.frame(y = runif(10^5, 0, 1))

# Initialize sparklyr
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.3.1")

Benchmarks

For completeness, adding sparkR, which gets initialized as:

# Initialize SparkR
Sys.setenv(SPARK_HOME = sparklyr::spark_home_dir("2.3.1"))
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sess <- sparkR.session(master = "local[*]")

Copying

copy_benchmark <- microbenchmark::microbenchmark(
    arrow = {
        library(arrow)
        sparklyr_df <<- dplyr::copy_to(sc, df, memory = T, overwrite = T)
        dplyr::count(sparklyr_df)
    },
    sparklyr = {
        if ("arrow" %in% .packages()) detach("package:arrow")
        sparklyr_df <<- dplyr::copy_to(sc, df, memory = T, overwrite = T)
        dplyr::count(sparklyr_df)
    },
    sparkr = {
        sparkr_df <<- SparkR::cache(SparkR::as.DataFrame(df))
        SparkR::count(sparkr_df)
    },
    times = 10
)

ggplot2::autoplot(copy_benchmark)

Collecting

collect_benchmark <- microbenchmark::microbenchmark(
    arrow = {
        library(arrow)
        sparklyr_local <<- dplyr::collect(sparklyr_df)
    },
    sparklyr = {
        if ("arrow" %in% .packages()) detach("package:arrow")
        sparklyr_local <<- dplyr::collect(sparklyr_df)
    },
    sparkr = {
        sparkr_local <<- collect(sparkr_df)
    },
    times = 10
)

ggplot2::autoplot(collect_benchmark)

Running this benchmark with 10^6 entries shows improvements under arrow,

df_large <- data.frame(y = runif(10^6, 0, 1))
sparklyr_large <<- dplyr::copy_to(sc, df_large, memory = T, overwrite = T)

collect_large_benchmark <- microbenchmark::microbenchmark(
    arrow = {
        library(arrow)
        sparklyr_local <<- dplyr::collect(sparklyr_large)
    },
    sparklyr = {
        if ("arrow" %in% .packages()) detach("package:arrow")
        sparklyr_local <<- dplyr::collect(sparklyr_large)
    },
    times = 10
)

ggplot2::autoplot(collect_large_benchmark)

spark_apply()

r_benchmark <- microbenchmark::microbenchmark(
    arrow = {
        library(arrow)
        spark_apply(sparklyr_df, ~ .x / 1.2, columns = list(x = "numeric"), env = list(R_ENABLE_JIT = "0"), memory = F) %>% dplyr::count() %>% dplyr::collect()
    },
    sparklyr = {
        if ("arrow" %in% .packages()) detach("package:arrow")
        spark_apply(sparklyr_df, ~ .x / 1.2, columns = list(x = "numeric"), memory = F) %>% dplyr::count() %>% dplyr::collect()
    },
    sparkr = {
        dapply(sparkr_df, function(x) x / 1.2, structType(structField("y", "double"))) %>% SparkR::count()
    },
    times = 10, control = list(order = "block")
)

ggplot2::autoplot(r_benchmark)

Notice that JIT was turned off since it adds a bit of overhead in spark_apply() for this particular example, here is a detailed comparison between JIT enabled/disabled with arrow:

jit_benchmark <- microbenchmark::microbenchmark(
    jit_off = {
        library(arrow)
        spark_apply(sparklyr_df, ~ .x / 1.2, columns = list(x = "numeric"), env = list(R_ENABLE_JIT = "0"), memory = F) %>% dplyr::count() %>% dplyr::collect()
    },
    jit_on = {
        library(arrow)
        spark_apply(sparklyr_df, ~ .x / 1.2, columns = list(x = "numeric"), memory = F) %>% dplyr::count() %>% dplyr::collect()
    },
    times = 10
)

ggplot2::autoplot(jit_benchmark)

Here is a profile measuring time spent while running spark_apply(), loading arrow seems to take 260ms which could be worth investigating further at some point:

Comparing with scala:

def time[R](block: => R): R = {
    val t0 = System.currentTimeMillis()
    val result = block    // call-by-name
    val t1 = System.currentTimeMillis()
    println("Elapsed time: " + (t1 - t0) + "ms")
    result
}

val data = spark.range(1,100000,1).cache

time { data.map(_ / 1.2).count() }

Elapsed time: 174ms
res16: Long = 99999

Tests

From the Travis run performance results, we can compare execution against arrow as follows:

library(dplyr)

data_sparklyr <- read.csv("~/RStudio/temp/sparklyr-perf-tests-spark.txt") %>%
  pull() %>%
  stringr::str_match(., "(.*) ([0-9]+\\.?[0-9]*)") %>%
  as.data.frame() %>%
  transmute(test = trimws(V2), serializer = "sparklyr", time = as.numeric(trimws(V3)))

data_arrow <- read.csv("~/RStudio/temp/sparklyr-perf-tests-arrow.txt") %>%
  pull() %>%
  stringr::str_match(., "(.*) ([0-9]+\\.?[0-9]*)") %>%
  as.data.frame() %>%
  transmute(test = trimws(V2), serializer = "arrow", time = as.numeric(trimws(V3)))

library(ggplot2)
bind_rows(data_sparklyr, data_arrow) %>%
  ggplot(aes(x=test, y=time, fill = serializer)) +
    geom_bar(stat='identity', position='dodge') +
    theme(axis.text.x = element_blank(), axis.ticks.x=element_blank())

bind_rows(data_sparklyr, data_arrow) %>%
  tidyr::spread(serializer, time) %>%
  summarise(arrow = sum(arrow), sparklyr = sum(sparklyr))

     arrow sparklyr
1 1406.354 1475.308

Overall, arrow tests execute faster than the sparklyr serializer, Travis tests use only small datasets but help ensure unnecessary overhead is not being introduced.

…w reader

wesm · 2018-11-11T22:24:34Z

huzzah!

wesm · 2018-11-11T22:26:07Z

@javierluraschi would you be interested in doing a write up for the Apache Arrow blog about this work, including all the benchmark results?

javierluraschi · 2018-11-12T20:14:55Z

@wesm yes, for sure. However, I'm not considering this work complete, mostly due to arrow_data.R#L21, since I'm currently tuning off arrow for the unsupported data types, we have dates almost figured out but nested data is also missing. I'm also investigating larger copy/collect use cases by tweaking batches.

So, we could write a "preliminary results" post in your blog mentioning these caveats and the current state of this work, or we could wait until we push everything to CRAN, which is probably a couple months away, or do both posts.

What's your take?

wesm · 2018-11-12T20:47:33Z

I recommend a blog much sooner as a means of also drumming up community involvement.

javierluraschi · 2018-11-12T21:38:27Z

@wesm Makes sense. How do I send you a blog post?

wesm · 2018-11-12T21:44:26Z

You can do it as a pull request to the site/ directory in the Arrow repo

javierluraschi · 2018-11-20T04:59:13Z

@wesm here is a draft post: apache/arrow#3001

javierluraschi mentioned this pull request Jul 24, 2018

Consider adding support batches jimhester/rarrow#3

Open

kevinykuo mentioned this pull request Jul 27, 2018

copy_to doesn't copy entire data #1625

Closed

javierluraschi mentioned this pull request Oct 1, 2018

implement spark_read_feather() #126

Closed

javierluraschi force-pushed the feature/arrow branch 2 times, most recently from c6770e9 to 6ab9709 Compare October 3, 2018 03:45

This was referenced Oct 11, 2018

Long hang time when copying R data frame/table to Spark #1710

Closed

ARROW-3490: [R] streaming of arrow objects to streams apache/arrow#2749

Closed

javierluraschi mentioned this pull request Oct 26, 2018

[TEST] Run all Travis tests with Arrow enabled #1727

Closed

javierluraschi force-pushed the feature/arrow branch 4 times, most recently from 4d5d4ae to 3c253e5 Compare October 27, 2018 21:26

javierluraschi changed the title ~~[WIP] Implement Arrow~~ Implement Arrow Oct 30, 2018

javierluraschi force-pushed the feature/arrow branch from ba12f2f to 2ca93ae Compare November 2, 2018 00:13

javierluraschi added 16 commits November 1, 2018 17:34

start with naive conversion using feather

ef12f41

use scala helper to load binary rdd holding arrow data

d1bb682

support arrowconverters using public spark api

bad70b1

make use of converters from arrow_copy_to

d8cf360

fix typos while calling arrow converter from r

9e5ff64

add interface for python arrow serializer to compare with ease

fda084e

reuse known schema instead of relying on arrow's file schema

596fb28

use proper arrow batch writer since batches are expected in java arro…

f0a52de

…w reader

avoid __index_level_0__ while converting to arrow

a33ae58

use internal rows to match pythons arrow converter

a223578

use sparklyr's invoke to properly match method arguments

b64e4ed

fix reticulate reference in arrow poc

348493c

add arrow as remote

5963031

make use of new arrow serializer and default to this

d7c0ff8

simplify arrow serializers to only use the R serializer

a6d1b2b

enable arrow upstream serialization in sdf_copy_to, dplyr and dbi

7ef97b1

javierluraschi added 16 commits November 1, 2018 17:39

improve performance results formatting

a994e90

add missing r6 suggest

d846aa4

fix perf printing order

de7726b

verbose livy to avoid travis timeout

70874dd

rescilient checks while printing perf results

6c2be97

log start of livy install

7d1210c

properly disable code coverage

b30d9fb

livy memory usage improvements

1be357f

fix performance reporter formatting

9a48865

split kmeans and spark apply test to run subset in livy

31181ec

use repartitions parameter in arrow to improve perf in master

0a7b757

re-enable pivot test with arrow with upstream fix

8145dcc

use arrow branch with bit64 description fix

a74a19c

fix remote in description

22b0d78

arrow installed from travis helper

cc18a00

revert remote branch since ARROW-3657 is merged

2deea96

javierluraschi force-pushed the feature/arrow branch from 2ca93ae to 2deea96 Compare November 2, 2018 03:17

rebuild jars, sources and docs

ed1413f

javierluraschi merged commit 78bbe0a into master Nov 2, 2018

javierluraschi mentioned this pull request Nov 26, 2018

The spark_apply function uses R extension packages very slowly #1781

Closed

javierluraschi mentioned this pull request Dec 18, 2018

Arrow #1457

Closed

Karthick333031 mentioned this pull request Dec 18, 2018

Error: Unable to use sparklyr to query large data #1815

Closed

HyukjinKwon mentioned this pull request Feb 15, 2019

[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame apache/spark#23760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Arrow #1611

Implement Arrow #1611

javierluraschi commented Jul 20, 2018 •

edited

Loading

wesm commented Nov 11, 2018

wesm commented Nov 11, 2018

javierluraschi commented Nov 12, 2018

wesm commented Nov 12, 2018

javierluraschi commented Nov 12, 2018

wesm commented Nov 12, 2018

javierluraschi commented Nov 20, 2018

Implement Arrow #1611

Implement Arrow #1611

Conversation

javierluraschi commented Jul 20, 2018 • edited Loading

Benchmarks

Copying

Collecting

spark_apply()

Tests

wesm commented Nov 11, 2018

wesm commented Nov 11, 2018

javierluraschi commented Nov 12, 2018

wesm commented Nov 12, 2018

javierluraschi commented Nov 12, 2018

wesm commented Nov 12, 2018

javierluraschi commented Nov 20, 2018

javierluraschi commented Jul 20, 2018 •

edited

Loading