Use structs/pairs instead of NumericVectors #23

hadley · 2016-02-05T20:44:27Z

I suspect you could get a considerable performance improvement by switching from (e.g)

double euclid(NumericVector p1, NumericVector p2) {
  double retval = 0;
  int n = p1.size();
  for (int i = 0; i < n; i++) {
    retval += pow(p2[i] - p1[i], 2);
  }
  return sqrt(retval);
}

to

struct Point {
  double x, y
};

double euclid(const Point& p1, const Point& p2) {
  return sqrt((p1.x - p2.x) ^ 2 + (p1.y - p2.y) ^ 2);
}

Note that I you might also be able to drop the square root, by instead squaring the distance that you're comparing with. And it would definitely be more efficient than later on re-squaring, like in repel_force().

Generally you'll get better performance with C++ data structures than with R ones.

The text was updated successfully, but these errors were encountered:

slowkow · 2016-02-07T01:42:47Z

Hadley, thanks for the tip! Any speed improvement is very much appreciated, and this was an easy one.

I see about 2.5X speed improvement with C++ data structures (Point, Box) in place of R data structures (NumericVector, NumericMatrix). For comparison, I also tested geom_text instead of geom_text_repel and found that geom_text_repel is about 2.4X slower than geom_text.

Unit: milliseconds
geom_text:

     min       lq     mean   median       uq      max neval
 175.033 278.2243 312.1802 283.0723 383.8128 507.6183    10

Unit: milliseconds
optimized, with C++ structs:

      min       lq     mean   median       uq      max neval
 661.7191 676.8665 680.5037 680.9892 684.1769 699.5232    10

Unit: seconds
unoptimized, with R data structures (NumericVector, NumericMatrix):

      min       lq     mean   median       uq      max neval
 1.578968 1.657331 1.674493 1.676227 1.698223 1.758947    10

I'm using microbenchmark with this code:

library(microbenchmark)
library(ggrepel)
microbenchmark({
  set.seed(42)
  d <- mtcars
  d$name <- rownames(mtcars)
  d <- as.data.frame(rbind(d, d))
  d$wt[33:64] <- d$wt[33:64] + 1
  d$mpg[33:64] <- d$mpg[33:64] + 5
  p <- ggplot(d, aes(wt, mpg, label = name)) +
    geom_point(color = 'red') +
    geom_text_repel() +
    theme_classic(base_size = 16)
  print(p)
}, times = 10L)

I would consider this to be a very large dataset for ggrepel at n=64, because it might not be sensible to clutter a plot with so many text labels. With my inefficient O(n^2) algorithm, that's about 2000 * 64 * 64 = 8,192,000 iterations of computing distance and forces. (See here for discussion about advanced algorithms that are O(log n).)

Instead of using R data structures like: NumericVector NumericMatrix use C++ structures like: typedef struct { double x, y; } Point; typedef struct { double x1, y1, x2, y2; } Box; The code runs about 2.5X faster on my laptop with this optimization. This commit addresses issue #23

hadley · 2016-02-08T15:36:38Z

I think the easiest way to avoid the O(n^2) computation is to divide the plot into a grid, and then only check the current and adjacent cells for overlaps. You can either do this roughly by hand (doing e.g. an 8 x 8 grid), or use something more complicated like a quadtree. Alternatively, you could do the full n x n sweep to build up a list of neighbours, but only update it (say) every 100 iterations.

Another thing you could do to make life a little easier would be to parameterise by iteration time (i.e. number of seconds) rather than number of iterations.

slowkow · 2016-02-08T16:07:17Z

Thanks for the tip about using a grid. I think that is indeed the simplest approach to achieve faster neighbour comparison. Since I expect a small number of labels, I'm skeptical if the performance gain will be worth the implementation complexity. In the future, this may become more important for a geom_jitter alternative (#20) that may have to deal with thousands of points.

I tried using inline on the euclid function and that seemed to save 82ms or so in the previously mentioned benchmark:

      min       lq     mean   median       uq      max neval
 567.9925 587.5553 601.3577 598.3028 609.5485 672.4328    25

However, I'm not seeing any improvement on top of that when I avoid the sqrt() call.

I'd be very happy to review pull requests that improve performance! If anyone reading this cares to give it a shot, that would be delightful.

hadley · 2016-02-08T23:22:14Z

That said, ordering by x might be just as simple, and would probably be faster.

I'm likely to come back to this in a couple of months - I'm going to be working on gggeom which will provide the data structures and efficient C++ code that will power ggvis.

slowkow closed this as completed Feb 7, 2016

slowkow added the enhancement label Feb 7, 2016

slowkow mentioned this issue Nov 29, 2016

speed up the algorithm #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use structs/pairs instead of NumericVectors #23

Use structs/pairs instead of NumericVectors #23

hadley commented Feb 5, 2016

slowkow commented Feb 7, 2016

hadley commented Feb 8, 2016

slowkow commented Feb 8, 2016

hadley commented Feb 8, 2016

Use structs/pairs instead of NumericVectors #23

Use structs/pairs instead of NumericVectors #23

Comments

hadley commented Feb 5, 2016

slowkow commented Feb 7, 2016

hadley commented Feb 8, 2016

slowkow commented Feb 8, 2016

hadley commented Feb 8, 2016