-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
Milestone
Description
I would like to move to more uniform implementation of dplyr memes; I really like the syntax. However, I am seeing several instances where dplyr analogues to plyr or base-R functions incur a severe performance hit on my data sets.
Here is a simple example ilustrating that dplyr's n_distinct is a factor of two slower than base-R.
library(dplyr)
library(microbenchmark)
y <- rep(1:4096, 100)
microbenchmark(length(unique(y)), n_distinct(y))
The results are:
Unit: milliseconds
expr min lq mean median uq max neval
length(unique(y)) 4.336269 6.187241 6.997882 6.378097 6.529753 41.15351 100
n_distinct(y) 9.587087 10.190942 10.375032 10.404678 10.578767 10.95172 100
Reactions are currently unavailable