Simple/basic/limited/incomplete benchmark for dplyr and data.table
n = 10M, 100M and
m = 100, 10K, 1M, create data.frames
d <- data.frame(x = sample(m, n, replace=TRUE), y = runif(n)) dm <- data.frame(x = sample(m))
and corresponding data.tables with and without key on
d's size in RAM is
around 100MB and 1GB, respectively).
The basic tabular operations (filter, aggregate, join etc.) are applied using base, dplyr (with data.frame and data.table backends, with and without key for data.table) and standard data.table (with and without key).
This is just a simple/basic/limited/incomplete benchmark, could do more with various data types (e.g. character), several grouping variables (x1,x2,...), more values for size parameters (n,m), different distributions of values in the data.frames etc. (or with real-world datasets).
d[d$x>=10 & d$x<20,] d %>% filter(x>=10, x<20) dt[x>=10 & x<20]
d[order(d$x),] d %>% arrange(x) dt[order(x)]
d$y2 <- 2*d$y d %>% mutate(y2 = 2*y) dt[,y2 := 2*y]
tapply(d$y, d$x, mean) d %>% group_by(x) %>% summarize(ym = mean(y)) dt[, mean(y), by=x]
merge(d, dm, by="x") d %>% inner_join(dm, by="x") dt[dtm, nomatch=0]
Full code in
bm.Rmd and results for each n,m in
bm-nxx-mxx.md files in the repo. Latest CRAN
versions of R, dplyr and data.table have been used (R 3.1.1, dplyr 0.3.0.2 and data.table 1.9.4).
A summary of results (relative running times, lower is better) is here:
(the larger numbers are usually for larger
m, i.e. lots of small groups)
Having a key (which for data.table it means having the data pre-sorted in place) obviously helps with sorting, aggregation and joins (depending on the use case though, the time to generate the key should be added to the timing)
dplyr with data.table backend/source is almost as fast as plain data.table (because in this case dplyr acts as a wrapper and calls data.table functions behind the scenes) - so, you can kindda have both: dplyr API (my personal preference) and speed
dplyr with data.frame source is slower than data.table for sort, aggregation and joins. Some of this has apparently to do with radix sort and binary search joining (data.table) being faster than hash-table based joins (dplyr) as described here, but some of it is likely to be improved as Hadley said here.
Defining a new column in data.table (or dplyr with the data.table backend) is slower. I pointed out this to data.table developers Matt and Arun and this can be fixed.The extra slowdown in creating a new column with dplyr with data.table source (vs plain data.table) can also be fixed.
There are several other benchmarks, for example Matt's benchmark of group-by, or Brodie Gaslam's benchmark of group-by and mutate. My goal was to look at a wider range of operations (but keep the work minimal, so I had to concentrate on a few samples) - and I also wanted to understand the reasons for such performance, and in this respect I'd like to thank the developers for the useful pointers.
Besides R, Python is almost as widely used for data analysis nowadays (and see how they dominate in the DataScience.LA data science toolbox survey).
It looks like Python's pandas (0.15.1) is slower than data.table for both aggregates and joins (contrary to measurements/claims from almost 3 years ago). For example for
n = 10M and
m = 1M runtimes (in seconds, lower is better):