Skip to content

n_distinct way slower than length(unique) #977

@dpeterson71

Description

@dpeterson71

I would like to move to more uniform implementation of dplyr memes; I really like the syntax. However, I am seeing several instances where dplyr analogues to plyr or base-R functions incur a severe performance hit on my data sets.

Here is a simple example ilustrating that dplyr's n_distinct is a factor of two slower than base-R.

library(dplyr)
library(microbenchmark)
y <- rep(1:4096, 100)
microbenchmark(length(unique(y)), n_distinct(y))

The results are:

Unit: milliseconds
              expr      min        lq      mean    median        uq      max neval
 length(unique(y)) 4.336269  6.187241  6.997882  6.378097  6.529753 41.15351   100
     n_distinct(y) 9.587087 10.190942 10.375032 10.404678 10.578767 10.95172   100

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions