n_distinct way slower than length(unique) #977

dpeterson71 · 2015-02-17T19:26:36Z

I would like to move to more uniform implementation of dplyr memes; I really like the syntax. However, I am seeing several instances where dplyr analogues to plyr or base-R functions incur a severe performance hit on my data sets.

Here is a simple example ilustrating that dplyr's n_distinct is a factor of two slower than base-R.

library(dplyr)
library(microbenchmark)
y <- rep(1:4096, 100)
microbenchmark(length(unique(y)), n_distinct(y))

The results are:

Unit: milliseconds
              expr      min        lq      mean    median        uq      max neval
 length(unique(y)) 4.336269  6.187241  6.997882  6.378097  6.529753 41.15351   100
     n_distinct(y) 9.587087 10.190942 10.375032 10.404678 10.578767 10.95172   100

The text was updated successfully, but these errors were encountered:

dgrtwo · 2015-08-05T18:30:18Z

A simulation and graph exploring the effect of vector length and uniqueness on the performance of n_distinct and length(unique( is below, also shown here. It looks like the performance difference is greater for long vectors made up of many uniques.

library(dplyr)
library(microbenchmark)

set.seed(2015-08-05)

time_distinct <- function(uniques, len, class) {
  x <- sample(uniques, len, replace = TRUE)
  x <- as(x, class)
  m <- microbenchmark(n_distinct(x),
                      length(unique(x)), 
                      times = 1000L)
  ret <- summary(m)
  ret$uniques <- length(unique(x))
  ret
}

timings <- expand.grid(uniques = round(10 ^ seq(0, 5, 1)),
                       len = round(10 ^ seq(1, 4, 1)),
                       class = c("numeric", "character"),
                       stringsAsFactors = FALSE) %>%
  group_by(uniques, len, class) %>%
  do(do.call(time_distinct, .))

library(ggplot2)

ggplot(timings, aes(uniques, median, color = expr, lty = class)) +
  geom_line() +
  scale_x_log10() +
  geom_errorbar(aes(ymin = lq, ymax = uq), width = .1) +
  facet_wrap(~ len, scales = "free") +
  xlab("Number of unique elements in vector") +
  ylab("Median time")

krlmlr · 2016-09-25T10:46:50Z

Confirmed. This will need some profiling, it's not immediately obvious where the extra time is spent. Some notes:

Hash table size should be set to the expected size for each chunk
Currently, SlicingIndex instantiates an IntegerVector for natural sequences, this could be optimized away, either with an interface and two implementations, or with templates

krlmlr · 2016-09-26T07:42:28Z

Can you please try devtools::install_github("krlmlr/dplyr@f-977-hash-table-size")? It's running roughly 30% faster now on my machine.

krlmlr · 2016-09-29T10:16:11Z

Further improvements should be done in the context of r-lib/vctrs#8.

krlmlr · 2018-03-25T13:28:29Z

I wonder if we can do better now with NaturalSlicingIndex.

krlmlr · 2018-03-25T13:46:15Z

But there's also quadratic run time that occurs only when strings are involved, thanks @randomgambit:

library(dplyr)
library(microbenchmark)

test <- function(N, M, transform = identity) {
  x <- rep(seq_len(N), M)
  y <- sample(transform(rep(seq_len(N), M)))
  tbl <- tibble(x, y) %>% group_by(x)
  
  tbl %>% mutate(n = n_distinct(y))
}

system.time(test(500, 1000))
#>    user  system elapsed 
#>   0.104   0.004   0.109
system.time(test(1000, 1000))
#>    user  system elapsed 
#>   0.304   0.012   0.316
system.time(test(1500, 1000))
#>    user  system elapsed 
#>   0.412   0.004   0.417
system.time(test(2000, 1000))
#>    user  system elapsed 
#>   0.639   0.004   0.644
system.time(test(2500, 1000))
#>    user  system elapsed 
#>   0.665   0.024   0.690
system.time(test(3000, 1000))
#>    user  system elapsed 
#>   0.936   0.040   0.976


system.time(test(500, 100, as.character))
#>    user  system elapsed 
#>   0.082   0.000   0.082
system.time(test(1000, 100, as.character))
#>    user  system elapsed 
#>    0.26    0.00    0.26
system.time(test(1500, 100, as.character))
#>    user  system elapsed 
#>   0.586   0.000   0.586
system.time(test(2000, 100, as.character))
#>    user  system elapsed 
#>   0.998   0.000   0.998
system.time(test(2500, 100, as.character))
#>    user  system elapsed 
#>   1.559   0.000   1.559
system.time(test(3000, 100, as.character))
#>    user  system elapsed 
#>   2.416   0.000   2.416

Created on 2018-03-25 by the reprex package (v0.2.0).

romainfrancois · 2018-03-27T05:51:30Z

Is that the same thing as this perhaps: r-lib/tidyselect#56

romainfrancois · 2018-03-27T06:04:37Z

There are two things here:

String is too careful about the protection of the CHARSXP. There's probably room for a String that does not do any of that protection.
n_distinct is implemented with a hash set. We can probably revisit that and use a sorting based algorithm, perhaps even borrow the radix sorting from data.table. The other bonus is that it is much easier to run a sort in parallel.

This might be a vctrs thing. Perhaps I can work on that in a separate package for the time being.

krlmlr · 2018-03-29T22:48:15Z

We need to use a profiler to identify the bottleneck here. Unfortunately, gprofiler doesn't work (yet) on the Mac, but I can take a look later.

romainfrancois · 2018-04-09T10:10:18Z

unique uses its own hash table implementation. it might be worth extracting it out from the R source into a package at some point. unique actually initially create a logical vector (the one we would get from duplicate), then extract the unique values afterwards. Perhaps we could skip that last step. decyphering what goes on in unique.c might be interesting in the context of vctrs.

but for the mean time, perhaps we can isolate the special case when there is only one variable and use length(unique(.)) in that case, something like that:

n_distinct <- function(..., na.rm = FALSE) {
  dots <- list(...)
  if(length(dots)==1L) length(unique(dots[[1L]])) else n_distinct_multi(list(...), na.rm)
}

krlmlr · 2018-04-09T23:45:35Z

I'd like to see a profile before starting any action.

romainfrancois · 2018-05-29T14:14:53Z

At least, n_distinct still performs better when there are more than one variables:

> library(dplyr)
> 
> y <- rep(1:4096, 100)
> x <- sample(y)
> d <- data.frame(x = x, y=y)
> 
> microbenchmark::microbenchmark(
+   nrow(unique(d)), 
+   n_distinct(d$x, d$y), 
+   times = 10
+ )
Unit: milliseconds
                 expr      min       lq     mean   median       uq      max neval
      nrow(unique(d)) 727.5183 736.8192 835.4494 876.3191 894.7565 912.2624    10
 n_distinct(d$x, d$y) 151.9319 159.0957 176.5848 177.9946 192.0380 201.7723    10

unsurprinsingly, because duplicated.data.frame looks like this:

> duplicated.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...) 
{
    if (!isFALSE(incomparables)) 
        .NotYetUsed("incomparables != FALSE")
    if (length(x) != 1L) 
        duplicated(do.call(Map, c(list, x)), fromLast = fromLast)
    else duplicated(x[[1L]], fromLast = fromLast, ...)
}
<bytecode: 0x10d9c0f28>

so it does a special case for the case when there is only one variable.

krlmlr · 2018-05-30T12:58:28Z

I'll make this a use case for jointprof.

romainfrancois · 2018-10-17T10:10:12Z

@krlmlr any luck with the profiler ?

krlmlr · 2018-11-07T10:35:11Z

Not yet, the timing is still consistent with the OP.

Would you mind trying jointprof? The README should get you started: https://r-prof.github.io/jointprof/. (It has worked for me on OS X, but I haven't tried in a while.)

romainfrancois · 2018-11-07T10:37:01Z

I think I'll get on this once we can use the hashing from vctrs

LVG77 · 2019-03-01T16:15:05Z

I am not sure whether this could be helpful for solving this issue but here is something that I came across yesterday while trying to figure out why my dplyr pipe is so slow. It seems like it is a n_distinct() issue but the interesting part is that n_distinct() performs much faster when used with data.table. In this specific case we are looking at ~60x speed increase.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(data.table))
library(microbenchmark)

# create dummy data
df <- tibble(fact = rep(letters, each = 12 * 1e3), 
             month = rep(month.name, 26 * 1e3), 
             num = rep(sample(10000:100000, 312*10), 1e2), 
             prob = runif(312 * 1e3))


# run benchmark
microbenchmark(
  dplyr = df %>% 
    group_by(fact, num) %>% 
    summarise(count = n_distinct(month))
  ,
  data.table = as.data.table(df)[, .(count = n_distinct(month)), by = .(fact, num)]
  , times = 1L)
#> Unit: milliseconds
#>        expr        min         lq       mean     median         uq
#>       dplyr 64278.0464 64278.0464 64278.0464 64278.0464 64278.0464
#>  data.table   377.4892   377.4892   377.4892   377.4892   377.4892
#>         max neval
#>  64278.0464     1
#>    377.4892     1

^{Created on 2019-03-01 by the reprex package (v0.2.1)}

Please let me know if I can be of any help.
Thank you for all the great work you are doing!

foeffa · 2019-03-31T17:58:00Z

Very similar observation here, comparable to @MichaelAdolph. Confidential client data, so can't share more but key points are as follows:

7M row dataset
Have to group on 3 variables, should end up with ~700K rows,
Unique values in n_distinct vector quite limited (4-5)

Runs horribly slow. I.e. "I can't use this in production"-type slow.

lionel- · 2019-05-27T16:41:33Z

I can't reproduce this extreme slowness with the CRAN binaries, however it seems that dplyr is extremely sensitive to the optimisation flags it is compiled with.

library(dplyr)
library(vctrs)

y <- rep(1:4096, 100)

bench::mark(
  base = length(unique(y)),
  dplyr = n_distinct(y),
  vctrs = vec_unique_count(y)
)[1:6]


## Optimised binaries

#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base         4.11ms   4.69ms      194.    5.58MB     24.5
#> 2 dplyr        7.61ms   8.09ms      123.    2.39KB      0
#> 3 vctrs         4.4ms   4.85ms      196.    5.57MB     22.9


## No optimisation (except for base)

#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base         4.12ms   4.62ms     212.     5.58MB     39.1
#> 2 dplyr       53.58ms  54.95ms      18.1        0B      0
#> 3 vctrs       13.13ms  13.51ms      73.1    5.57MB     11.8

The debug builds are especially slow with strings:

library(dplyr)
library(vctrs)

res <- bench::press(
  N = c(2000, 4000, 6000, 8000, 10000),
  fn = c("identity", "as.character"),
  {
    M <- 100
    x <- rep(seq_len(N), M)

    # Can't press over a list?
    fn <- get(fn)
    y <- sample(fn(x))

    bench::mark(
      base = length(unique(y)),
      dplyr = n_distinct(y),
      vctrs = vec_unique_count(y)
    )
  }
)

library(ggplot2)

res %>%
  ggplot(aes(
    N,
    median,
    colour = as.character(expression)
  )) +
  geom_point() +
  geom_line() +
  facet_wrap(~fn)

Release build

Debug build

IIUC, dplyr uses an STL hashmap? That could explain the slowness, as heavily templated C++ code is known to have slow debug builds, they need compiler inlining for speed.

hadley · 2019-05-27T17:37:23Z

Thanks @lionel-! We'll switch the implementation to use vec_unique_count() in 0.9.0. I'll open an issue to review the vectorised functions and implement using vctrs.

Demetrio92 · 2019-05-28T12:53:22Z

But there's also quadratic run time that occurs only when strings are involved

uff, for real? This thing just killed my R numerous times. Maybe as a workaround warn the users?

lionel- · 2019-05-28T13:01:14Z

uff, for real? This thing just killed my R numerous times. Maybe as a workaround warn the users?

Can you post a reprex with a problematic case? I couldn't find anything wrong as long as dplyr is properly compiled with optimisations. I may have missed something.

dkincaid · 2019-05-28T14:00:28Z

I can provide an extreme example that I was fighting last week. In my case I'm working with a data frame of about 4.6 million rows. But I can provide an example with a smaller set of records.

I group by one column and then am summarizing the distinct counts of two of other columns. Using n_distinct it ran for about 2 hours. After I googled a bit and found this issue I changed it to use length(unique()) and it now runs in about 1 second.

The example file I'm providing includes only 500,000 records. I'm attaching the data file to this comment.
small_problems.zip

As mentioned the full dataset of 4.6 million records does take at least 2 hours for the dplyr version to run.

library(microbenchmark)
library(dplyr)

load("/tmp/small_problems.Rdata")

microbenchmark(
  namesSummary_base = small_problems %>%
    group_by(tolower(problem_name)) %>%
    summarise(n = n(),
              numPractices = length(unique(rowkey)),
              numAnimals = length(unique(rowkey))
    ),

  namesSummary_dplyr = small_problems %>%
    group_by(tolower(problem_name)) %>%
    summarise(n = n(),
              numLocations = n_distinct(location_id),
              numAnimals = n_distinct(rowkey)),
  times = 1L)
#> Unit: milliseconds
#>                expr        min         lq       mean     median         uq          max neval
#>   namesSummary_base   434.6912   434.6912   434.6912   434.6912   434.6912     434.6912     1
#>   namesSummary_dplyr 27125.2335 27125.2335 27125.2335 27125.2335 27125.2335  27125.2335     1

^{Created on 2019-05-28 by the reprex package (v0.3.0)}

My dplyr version is 0.8.0.1 and I'm running on Ubuntu Linux 19.04.

I don't know how users are supposed to know if the dplyr we are using was "properly compiled with optimisations", so if there are special instructions for installing the dplyr package it needs to be documented.

dkincaid · 2019-05-28T14:11:46Z

Interestingly, I just thought I'd see if there was a newer version of dplyr out there and updated to 0.8.1 off of CRAN. I re-ran the test and now the dplyr version is just as fast as the base R version.

Here are the results using dplyr 0.8.1 of the above example I provided.

Unit: milliseconds
               expr      min       lq     mean   median       uq      max neval
  namesSummary_base 353.2978 353.2978 353.2978 353.2978 353.2978 353.2978     1
 namesSummary_dplyr 345.4369 345.4369 345.4369 345.4369 345.4369 345.4369     1

lionel- · 2019-05-28T14:30:22Z

@dkincaid Thanks, confirmed that with 0.8.0, a single run of your example takes 3 minutes with a debug build and 39s with a release build. On 0.8.1, the slowness is gone. Any idea what might have fixed it @romainfrancois? Perhaps something to do with groups or hybrid eval?

0.8.1

## Release build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base     545ms    545ms      1.83   36.88MB     5.50
#> 2 namesSummary_dplyr    424ms    444ms      2.25    8.48MB     1.13

## Debug build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base  881.94ms 881.94ms     1.13    38.88MB     1.13
#> 2 namesSummary_dplyr    1.32s    1.32s     0.757    9.04MB     0

0.8.0.1

## Release build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base     699ms    699ms    1.43     93.49MB     7.15
#> 2 namesSummary_dplyr      39s      39s    0.0256    8.58MB     0

## Debug build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base     1.01s    1.01s   0.990     93.49MB     2.97
#> 2 namesSummary_dplyr    3.26m    3.26m   0.00511    8.58MB     0

I don't know of any documentation for configuring compilation, except for https://cran.r-project.org/doc/manuals/r-release/R-admin.pdf. My guess is that your distribution creates a correctly configured Makevars file? See if adding this to ~/.R/Makevars helps:

CFLAGS +=  -O3
CXXFLAGS += -O3
CXX1XFLAGS += -O3

When you install a package, check the output for -On flags. The last one prevails, -O0 means no optimisation.

romainfrancois · 2019-05-29T04:21:10Z

I don't recall right now. Might be an accidental fix 😆

lock · 2019-11-25T05:16:58Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

romainfrancois self-assigned this Feb 23, 2015

romainfrancois added the feature a feature request or enhancement label Apr 22, 2015

hadley added this to the 0.5 milestone May 19, 2015

dgrtwo mentioned this issue Aug 5, 2015

n_distinct(x) slower than length(unique(x)) #1304

Closed

hadley changed the title ~~n_distinct way slower than length(unique) in version 0.4.1~~ n_distinct way slower than length(unique) Oct 22, 2015

romainfrancois mentioned this issue Nov 9, 2015

dplyr:distinct is not more performant than base:unique #1519

Closed

hadley added the data frame label Mar 1, 2016

hadley modified the milestones: future, 0.5 Mar 8, 2016

hadley mentioned this issue Sep 6, 2016

Efficient n_distinct r-lib/vctrs#8

Closed

krlmlr mentioned this issue Sep 26, 2016

Better default values for size of some hash tables krlmlr/dplyr#17

Closed

krlmlr closed this as completed in b0fb5b0 Sep 29, 2016

krlmlr added the performance 🚀 label Mar 25, 2018

krlmlr reopened this Mar 25, 2018

krlmlr mentioned this issue May 30, 2018

Use case: Slow n_distinct() r-prof/jointprof#24

Open

romainfrancois mentioned this issue May 31, 2018

Support data frame and matrix columns #3630

Closed

romainfrancois added the vctrs ↗️ label Nov 7, 2018

This comment has been minimized.

Sign in to view

hideaki mentioned this issue Apr 9, 2019

Wrapper to temporarily workaround n_distinct performance issue exploratory-io/exploratory_func#556

Merged

5 tasks

hadley modified the milestones: future, 0.9.0 May 27, 2019

hadley closed this as completed May 27, 2019

hadley mentioned this issue May 27, 2019

Use vctrs implementations for vectorised functions #4389

Closed

lock bot locked and limited conversation to collaborators Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n_distinct way slower than length(unique) #977

n_distinct way slower than length(unique) #977

dpeterson71 commented Feb 17, 2015

dgrtwo commented Aug 5, 2015

krlmlr commented Sep 25, 2016

krlmlr commented Sep 26, 2016

krlmlr commented Sep 29, 2016

krlmlr commented Mar 25, 2018 •

edited

Loading

krlmlr commented Mar 25, 2018

romainfrancois commented Mar 27, 2018

romainfrancois commented Mar 27, 2018

krlmlr commented Mar 29, 2018

romainfrancois commented Apr 9, 2018

krlmlr commented Apr 9, 2018

romainfrancois commented May 29, 2018

krlmlr commented May 30, 2018

romainfrancois commented Oct 17, 2018

krlmlr commented Nov 7, 2018

romainfrancois commented Nov 7, 2018

LVG77 commented Mar 1, 2019

This comment has been minimized.

foeffa commented Mar 31, 2019

lionel- commented May 27, 2019 •

edited

Loading

hadley commented May 27, 2019

Demetrio92 commented May 28, 2019

lionel- commented May 28, 2019

dkincaid commented May 28, 2019

dkincaid commented May 28, 2019 •

edited

Loading

lionel- commented May 28, 2019

romainfrancois commented May 29, 2019

lock bot commented Nov 25, 2019

n_distinct way slower than length(unique) #977

n_distinct way slower than length(unique) #977

Comments

dpeterson71 commented Feb 17, 2015

dgrtwo commented Aug 5, 2015

krlmlr commented Sep 25, 2016

krlmlr commented Sep 26, 2016

krlmlr commented Sep 29, 2016

krlmlr commented Mar 25, 2018 • edited Loading

krlmlr commented Mar 25, 2018

romainfrancois commented Mar 27, 2018

romainfrancois commented Mar 27, 2018

krlmlr commented Mar 29, 2018

romainfrancois commented Apr 9, 2018

krlmlr commented Apr 9, 2018

romainfrancois commented May 29, 2018

krlmlr commented May 30, 2018

romainfrancois commented Oct 17, 2018

krlmlr commented Nov 7, 2018

romainfrancois commented Nov 7, 2018

LVG77 commented Mar 1, 2019

This comment has been minimized.

foeffa commented Mar 31, 2019

lionel- commented May 27, 2019 • edited Loading

Release build

Debug build

hadley commented May 27, 2019

Demetrio92 commented May 28, 2019

lionel- commented May 28, 2019

dkincaid commented May 28, 2019

dkincaid commented May 28, 2019 • edited Loading

lionel- commented May 28, 2019

0.8.1

0.8.0.1

romainfrancois commented May 29, 2019

lock bot commented Nov 25, 2019

krlmlr commented Mar 25, 2018 •

edited

Loading

lionel- commented May 27, 2019 •

edited

Loading

dkincaid commented May 28, 2019 •

edited

Loading