Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n_distinct way slower than length(unique) #977

Closed
dpeterson71 opened this issue Feb 17, 2015 · 28 comments
Closed

n_distinct way slower than length(unique) #977

dpeterson71 opened this issue Feb 17, 2015 · 28 comments

Comments

@dpeterson71
Copy link

@dpeterson71 dpeterson71 commented Feb 17, 2015

I would like to move to more uniform implementation of dplyr memes; I really like the syntax. However, I am seeing several instances where dplyr analogues to plyr or base-R functions incur a severe performance hit on my data sets.

Here is a simple example ilustrating that dplyr's n_distinct is a factor of two slower than base-R.

library(dplyr)
library(microbenchmark)
y <- rep(1:4096, 100)
microbenchmark(length(unique(y)), n_distinct(y))

The results are:

Unit: milliseconds
              expr      min        lq      mean    median        uq      max neval
 length(unique(y)) 4.336269  6.187241  6.997882  6.378097  6.529753 41.15351   100
     n_distinct(y) 9.587087 10.190942 10.375032 10.404678 10.578767 10.95172   100
@romainfrancois romainfrancois self-assigned this Feb 23, 2015
@hadley hadley added this to the 0.5 milestone May 19, 2015
@dgrtwo
Copy link
Member

@dgrtwo dgrtwo commented Aug 5, 2015

A simulation and graph exploring the effect of vector length and uniqueness on the performance of n_distinct and length(unique( is below, also shown here. It looks like the performance difference is greater for long vectors made up of many uniques.

library(dplyr)
library(microbenchmark)

set.seed(2015-08-05)

time_distinct <- function(uniques, len, class) {
  x <- sample(uniques, len, replace = TRUE)
  x <- as(x, class)
  m <- microbenchmark(n_distinct(x),
                      length(unique(x)), 
                      times = 1000L)
  ret <- summary(m)
  ret$uniques <- length(unique(x))
  ret
}

timings <- expand.grid(uniques = round(10 ^ seq(0, 5, 1)),
                       len = round(10 ^ seq(1, 4, 1)),
                       class = c("numeric", "character"),
                       stringsAsFactors = FALSE) %>%
  group_by(uniques, len, class) %>%
  do(do.call(time_distinct, .))

library(ggplot2)

ggplot(timings, aes(uniques, median, color = expr, lty = class)) +
  geom_line() +
  scale_x_log10() +
  geom_errorbar(aes(ymin = lq, ymax = uq), width = .1) +
  facet_wrap(~ len, scales = "free") +
  xlab("Number of unique elements in vector") +
  ylab("Median time")

image

Loading

@hadley hadley changed the title n_distinct way slower than length(unique) in version 0.4.1 n_distinct way slower than length(unique) Oct 22, 2015
@hadley hadley added this to the future milestone Mar 8, 2016
@hadley hadley removed this from the 0.5 milestone Mar 8, 2016
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Sep 25, 2016

Confirmed. This will need some profiling, it's not immediately obvious where the extra time is spent. Some notes:

  • Hash table size should be set to the expected size for each chunk
  • Currently, SlicingIndex instantiates an IntegerVector for natural sequences, this could be optimized away, either with an interface and two implementations, or with templates

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Sep 26, 2016

Can you please try devtools::install_github("krlmlr/dplyr@f-977-hash-table-size")? It's running roughly 30% faster now on my machine.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Sep 29, 2016

Further improvements should be done in the context of r-lib/vctrs#8.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 25, 2018

I wonder if we can do better now with NaturalSlicingIndex.

Loading

@krlmlr krlmlr reopened this Mar 25, 2018
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 25, 2018

But there's also quadratic run time that occurs only when strings are involved, thanks @randomgambit:

library(dplyr)
library(microbenchmark)

test <- function(N, M, transform = identity) {
  x <- rep(seq_len(N), M)
  y <- sample(transform(rep(seq_len(N), M)))
  tbl <- tibble(x, y) %>% group_by(x)
  
  tbl %>% mutate(n = n_distinct(y))
}

system.time(test(500, 1000))
#>    user  system elapsed 
#>   0.104   0.004   0.109
system.time(test(1000, 1000))
#>    user  system elapsed 
#>   0.304   0.012   0.316
system.time(test(1500, 1000))
#>    user  system elapsed 
#>   0.412   0.004   0.417
system.time(test(2000, 1000))
#>    user  system elapsed 
#>   0.639   0.004   0.644
system.time(test(2500, 1000))
#>    user  system elapsed 
#>   0.665   0.024   0.690
system.time(test(3000, 1000))
#>    user  system elapsed 
#>   0.936   0.040   0.976


system.time(test(500, 100, as.character))
#>    user  system elapsed 
#>   0.082   0.000   0.082
system.time(test(1000, 100, as.character))
#>    user  system elapsed 
#>    0.26    0.00    0.26
system.time(test(1500, 100, as.character))
#>    user  system elapsed 
#>   0.586   0.000   0.586
system.time(test(2000, 100, as.character))
#>    user  system elapsed 
#>   0.998   0.000   0.998
system.time(test(2500, 100, as.character))
#>    user  system elapsed 
#>   1.559   0.000   1.559
system.time(test(3000, 100, as.character))
#>    user  system elapsed 
#>   2.416   0.000   2.416

Created on 2018-03-25 by the reprex package (v0.2.0).

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 27, 2018

Is that the same thing as this perhaps: r-lib/tidyselect#56

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 27, 2018

There are two things here:

  • String is too careful about the protection of the CHARSXP. There's probably room for a String that does not do any of that protection.

  • n_distinct is implemented with a hash set. We can probably revisit that and use a sorting based algorithm, perhaps even borrow the radix sorting from data.table. The other bonus is that it is much easier to run a sort in parallel.

This might be a vctrs thing. Perhaps I can work on that in a separate package for the time being.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 29, 2018

We need to use a profiler to identify the bottleneck here. Unfortunately, gprofiler doesn't work (yet) on the Mac, but I can take a look later.

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 9, 2018

unique uses its own hash table implementation. it might be worth extracting it out from the R source into a package at some point. unique actually initially create a logical vector (the one we would get from duplicate), then extract the unique values afterwards. Perhaps we could skip that last step. decyphering what goes on in unique.c might be interesting in the context of vctrs.

but for the mean time, perhaps we can isolate the special case when there is only one variable and use length(unique(.)) in that case, something like that:

n_distinct <- function(..., na.rm = FALSE) {
  dots <- list(...)
  if(length(dots)==1L) length(unique(dots[[1L]])) else n_distinct_multi(list(...), na.rm)
}

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Apr 9, 2018

I'd like to see a profile before starting any action.

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented May 29, 2018

At least, n_distinct still performs better when there are more than one variables:

> library(dplyr)
> 
> y <- rep(1:4096, 100)
> x <- sample(y)
> d <- data.frame(x = x, y=y)
> 
> microbenchmark::microbenchmark(
+   nrow(unique(d)), 
+   n_distinct(d$x, d$y), 
+   times = 10
+ )
Unit: milliseconds
                 expr      min       lq     mean   median       uq      max neval
      nrow(unique(d)) 727.5183 736.8192 835.4494 876.3191 894.7565 912.2624    10
 n_distinct(d$x, d$y) 151.9319 159.0957 176.5848 177.9946 192.0380 201.7723    10

unsurprinsingly, because duplicated.data.frame looks like this:

> duplicated.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...) 
{
    if (!isFALSE(incomparables)) 
        .NotYetUsed("incomparables != FALSE")
    if (length(x) != 1L) 
        duplicated(do.call(Map, c(list, x)), fromLast = fromLast)
    else duplicated(x[[1L]], fromLast = fromLast, ...)
}
<bytecode: 0x10d9c0f28>

so it does a special case for the case when there is only one variable.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 30, 2018

I'll make this a use case for jointprof.

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Oct 17, 2018

@krlmlr any luck with the profiler ?

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 7, 2018

Not yet, the timing is still consistent with the OP.

Would you mind trying jointprof? The README should get you started: https://r-prof.github.io/jointprof/. (It has worked for me on OS X, but I haven't tried in a while.)

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Nov 7, 2018

I think I'll get on this once we can use the hashing from vctrs

Loading

@LVG77
Copy link

@LVG77 LVG77 commented Mar 1, 2019

I am not sure whether this could be helpful for solving this issue but here is something that I came across yesterday while trying to figure out why my dplyr pipe is so slow. It seems like it is a n_distinct() issue but the interesting part is that n_distinct() performs much faster when used with data.table. In this specific case we are looking at ~60x speed increase.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(data.table))
library(microbenchmark)

# create dummy data
df <- tibble(fact = rep(letters, each = 12 * 1e3), 
             month = rep(month.name, 26 * 1e3), 
             num = rep(sample(10000:100000, 312*10), 1e2), 
             prob = runif(312 * 1e3))


# run benchmark
microbenchmark(
  dplyr = df %>% 
    group_by(fact, num) %>% 
    summarise(count = n_distinct(month))
  ,
  data.table = as.data.table(df)[, .(count = n_distinct(month)), by = .(fact, num)]
  , times = 1L)
#> Unit: milliseconds
#>        expr        min         lq       mean     median         uq
#>       dplyr 64278.0464 64278.0464 64278.0464 64278.0464 64278.0464
#>  data.table   377.4892   377.4892   377.4892   377.4892   377.4892
#>         max neval
#>  64278.0464     1
#>    377.4892     1

Created on 2019-03-01 by the reprex package (v0.2.1)

Please let me know if I can be of any help.
Thank you for all the great work you are doing!

Loading

@MichaelAdolph

This comment has been hidden.

@foeffa
Copy link

@foeffa foeffa commented Mar 31, 2019

Very similar observation here, comparable to @MichaelAdolph. Confidential client data, so can't share more but key points are as follows:

  • 7M row dataset

  • Have to group on 3 variables, should end up with ~700K rows,

  • Unique values in n_distinct vector quite limited (4-5)

Runs horribly slow. I.e. "I can't use this in production"-type slow.

Loading

@lionel-
Copy link
Member

@lionel- lionel- commented May 27, 2019

I can't reproduce this extreme slowness with the CRAN binaries, however it seems that dplyr is extremely sensitive to the optimisation flags it is compiled with.

library(dplyr)
library(vctrs)

y <- rep(1:4096, 100)

bench::mark(
  base = length(unique(y)),
  dplyr = n_distinct(y),
  vctrs = vec_unique_count(y)
)[1:6]


## Optimised binaries

#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base         4.11ms   4.69ms      194.    5.58MB     24.5
#> 2 dplyr        7.61ms   8.09ms      123.    2.39KB      0
#> 3 vctrs         4.4ms   4.85ms      196.    5.57MB     22.9


## No optimisation (except for base)

#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base         4.12ms   4.62ms     212.     5.58MB     39.1
#> 2 dplyr       53.58ms  54.95ms      18.1        0B      0
#> 3 vctrs       13.13ms  13.51ms      73.1    5.57MB     11.8

The debug builds are especially slow with strings:

library(dplyr)
library(vctrs)

res <- bench::press(
  N = c(2000, 4000, 6000, 8000, 10000),
  fn = c("identity", "as.character"),
  {
    M <- 100
    x <- rep(seq_len(N), M)

    # Can't press over a list?
    fn <- get(fn)
    y <- sample(fn(x))

    bench::mark(
      base = length(unique(y)),
      dplyr = n_distinct(y),
      vctrs = vec_unique_count(y)
    )
  }
)

library(ggplot2)

res %>%
  ggplot(aes(
    N,
    median,
    colour = as.character(expression)
  )) +
  geom_point() +
  geom_line() +
  facet_wrap(~fn)
Release build

ndistinct - release build

Debug build

ndistinct - debug build

IIUC, dplyr uses an STL hashmap? That could explain the slowness, as heavily templated C++ code is known to have slow debug builds, they need compiler inlining for speed.

Loading

@hadley
Copy link
Member

@hadley hadley commented May 27, 2019

Thanks @lionel-! We'll switch the implementation to use vec_unique_count() in 0.9.0. I'll open an issue to review the vectorised functions and implement using vctrs.

Loading

@Demetrio92
Copy link

@Demetrio92 Demetrio92 commented May 28, 2019

But there's also quadratic run time that occurs only when strings are involved

uff, for real? This thing just killed my R numerous times. Maybe as a workaround warn the users?

Loading

@lionel-
Copy link
Member

@lionel- lionel- commented May 28, 2019

uff, for real? This thing just killed my R numerous times. Maybe as a workaround warn the users?

Can you post a reprex with a problematic case? I couldn't find anything wrong as long as dplyr is properly compiled with optimisations. I may have missed something.

Loading

@dkincaid
Copy link

@dkincaid dkincaid commented May 28, 2019

I can provide an extreme example that I was fighting last week. In my case I'm working with a data frame of about 4.6 million rows. But I can provide an example with a smaller set of records.

I group by one column and then am summarizing the distinct counts of two of other columns. Using n_distinct it ran for about 2 hours. After I googled a bit and found this issue I changed it to use length(unique()) and it now runs in about 1 second.

The example file I'm providing includes only 500,000 records. I'm attaching the data file to this comment.
small_problems.zip

As mentioned the full dataset of 4.6 million records does take at least 2 hours for the dplyr version to run.

library(microbenchmark)
library(dplyr)

load("/tmp/small_problems.Rdata")

microbenchmark(
  namesSummary_base = small_problems %>%
    group_by(tolower(problem_name)) %>%
    summarise(n = n(),
              numPractices = length(unique(rowkey)),
              numAnimals = length(unique(rowkey))
    ),

  namesSummary_dplyr = small_problems %>%
    group_by(tolower(problem_name)) %>%
    summarise(n = n(),
              numLocations = n_distinct(location_id),
              numAnimals = n_distinct(rowkey)),
  times = 1L)
#> Unit: milliseconds
#>                expr        min         lq       mean     median         uq          max neval
#>   namesSummary_base   434.6912   434.6912   434.6912   434.6912   434.6912     434.6912     1
#>   namesSummary_dplyr 27125.2335 27125.2335 27125.2335 27125.2335 27125.2335  27125.2335     1

Created on 2019-05-28 by the reprex package (v0.3.0)

My dplyr version is 0.8.0.1 and I'm running on Ubuntu Linux 19.04.

I don't know how users are supposed to know if the dplyr we are using was "properly compiled with optimisations", so if there are special instructions for installing the dplyr package it needs to be documented.

Loading

@dkincaid
Copy link

@dkincaid dkincaid commented May 28, 2019

Interestingly, I just thought I'd see if there was a newer version of dplyr out there and updated to 0.8.1 off of CRAN. I re-ran the test and now the dplyr version is just as fast as the base R version.

Here are the results using dplyr 0.8.1 of the above example I provided.

Unit: milliseconds
               expr      min       lq     mean   median       uq      max neval
  namesSummary_base 353.2978 353.2978 353.2978 353.2978 353.2978 353.2978     1
 namesSummary_dplyr 345.4369 345.4369 345.4369 345.4369 345.4369 345.4369     1

Loading

@lionel-
Copy link
Member

@lionel- lionel- commented May 28, 2019

@dkincaid Thanks, confirmed that with 0.8.0, a single run of your example takes 3 minutes with a debug build and 39s with a release build. On 0.8.1, the slowness is gone. Any idea what might have fixed it @romainfrancois? Perhaps something to do with groups or hybrid eval?

0.8.1

## Release build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base     545ms    545ms      1.83   36.88MB     5.50
#> 2 namesSummary_dplyr    424ms    444ms      2.25    8.48MB     1.13

## Debug build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base  881.94ms 881.94ms     1.13    38.88MB     1.13
#> 2 namesSummary_dplyr    1.32s    1.32s     0.757    9.04MB     0
0.8.0.1
## Release build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base     699ms    699ms    1.43     93.49MB     7.15
#> 2 namesSummary_dplyr      39s      39s    0.0256    8.58MB     0

## Debug build

#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 namesSummary_base     1.01s    1.01s   0.990     93.49MB     2.97
#> 2 namesSummary_dplyr    3.26m    3.26m   0.00511    8.58MB     0

I don't know of any documentation for configuring compilation, except for https://cran.r-project.org/doc/manuals/r-release/R-admin.pdf. My guess is that your distribution creates a correctly configured Makevars file? See if adding this to ~/.R/Makevars helps:

CFLAGS +=  -O3
CXXFLAGS += -O3
CXX1XFLAGS += -O3

When you install a package, check the output for -On flags. The last one prevails, -O0 means no optimisation.

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented May 29, 2019

I don't recall right now. Might be an accidental fix 😆

Loading

@lock
Copy link

@lock lock bot commented Nov 25, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Loading

@lock lock bot locked and limited conversation to collaborators Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet