Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md
bm-n100m-m100.md
bm-n100m-m10k.md
bm-n100m-m1m.md
bm-n10m-m100.md
bm-n10m-m10k.md
bm-n10m-m1m.md
bm-pd.ipynb
bm.Rmd

README.md

Simple/basic/limited/incomplete benchmark for dplyr and data.table

For parameters n = 10M, 100M and m = 100, 10K, 1M, create data.frames

d <- data.frame(x = sample(m, n, replace=TRUE), y = runif(n))
dm <- data.frame(x = sample(m))

and corresponding data.tables with and without key on x (d's size in RAM is around 100MB and 1GB, respectively).

The basic tabular operations (filter, aggregate, join etc.) are applied using base, dplyr (with data.frame and data.table backends, with and without key for data.table) and standard data.table (with and without key).

This is just a simple/basic/limited/incomplete benchmark, could do more with various data types (e.g. character), several grouping variables (x1,x2,...), more values for size parameters (n,m), different distributions of values in the data.frames etc. (or with real-world datasets).

Filter
d[d$x>=10 & d$x<20,]
d %>% filter(x>=10, x<20)
dt[x>=10 & x<20]
Sort
d[order(d$x),]
d %>% arrange(x)
dt[order(x)]
New column
d$y2 <- 2*d$y
d %>% mutate(y2 = 2*y)
dt[,y2 := 2*y]
Aggregation
tapply(d$y, d$x, mean)
d %>% group_by(x) %>% summarize(ym = mean(y))
dt[, mean(y), by=x]
Join
merge(d, dm, by="x")
d %>% inner_join(dm, by="x")
dt[dtm, nomatch=0]

Results

Full code in bm.Rmd and results for each n,m in bm-nxx-mxx.md files in the repo. Latest CRAN versions of R, dplyr and data.table have been used (R 3.1.1, dplyr 0.3.0.2 and data.table 1.9.4). A summary of results (relative running times, lower is better) is here:

base dplyr-df dplyr-dt dplyr-dt-k dt dt-k
Filter 2 1 1 1 1 1
Sort 30-60 20-30 1.5-3 1 1.5-3 1
New column 1 1 6 4 6 4 4 1 4 1
Aggregation 8-100 4-30 4-6 1.5 1.5-5 1
Join >100 4-15 4-6 1.5-2.5 - 1

(the larger numbers are usually for larger m, i.e. lots of small groups)

Discussion:
  • Having a key (which for data.table it means having the data pre-sorted in place) obviously helps with sorting, aggregation and joins (depending on the use case though, the time to generate the key should be added to the timing)

  • dplyr with data.table backend/source is almost as fast as plain data.table (because in this case dplyr acts as a wrapper and calls data.table functions behind the scenes) - so, you can kindda have both: dplyr API (my personal preference) and speed

  • dplyr with data.frame source is slower than data.table for sort, aggregation and joins. Some of this has apparently to do with radix sort and binary search joining (data.table) being faster than hash-table based joins (dplyr) as described here, but some of it is likely to be improved as Hadley said here.

  • Defining a new column in data.table (or dplyr with the data.table backend) is slower. I pointed out this to data.table developers Matt and Arun and this can be fixed. The extra slowdown in creating a new column with dplyr with data.table source (vs plain data.table) can also be fixed.

More info:

I'm going to give I gave a short 15-min talk at the LA R meetup about dplyr, and I'll talk I talked about these results as well, slides here.

There are several other benchmarks, for example Matt's benchmark of group-by, or Brodie Gaslam's benchmark of group-by and mutate. My goal was to look at a wider range of operations (but keep the work minimal, so I had to concentrate on a few samples) - and I also wanted to understand the reasons for such performance, and in this respect I'd like to thank the developers for the useful pointers.

Python's pandas:

Besides R, Python is almost as widely used for data analysis nowadays (and see how they dominate in the DataScience.LA data science toolbox survey).

It looks like Python's pandas (0.15.1) is slower than data.table for both aggregates and joins (contrary to measurements/claims from almost 3 years ago). For example for n = 10M and m = 1M runtimes (in seconds, lower is better):

pandas data.table
Aggregate 1.5 1
Aggregate (keys) 0.4 0.2
Join 5.9 -
Join (keys) 2.1 0.5
Creating keys 3.7 0.7
You can’t perform that action at this time.