Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use structs/pairs instead of NumericVectors #23

Closed
hadley opened this issue Feb 5, 2016 · 4 comments
Closed

Use structs/pairs instead of NumericVectors #23

hadley opened this issue Feb 5, 2016 · 4 comments

Comments

@hadley
Copy link

hadley commented Feb 5, 2016

I suspect you could get a considerable performance improvement by switching from (e.g)

double euclid(NumericVector p1, NumericVector p2) {
  double retval = 0;
  int n = p1.size();
  for (int i = 0; i < n; i++) {
    retval += pow(p2[i] - p1[i], 2);
  }
  return sqrt(retval);
}

to

struct Point {
  double x, y
};

double euclid(const Point& p1, const Point& p2) {
  return sqrt((p1.x - p2.x) ^ 2 + (p1.y - p2.y) ^ 2);
}

Note that I you might also be able to drop the square root, by instead squaring the distance that you're comparing with. And it would definitely be more efficient than later on re-squaring, like in repel_force().

Generally you'll get better performance with C++ data structures than with R ones.

@slowkow
Copy link
Owner

slowkow commented Feb 7, 2016

Hadley, thanks for the tip! Any speed improvement is very much appreciated, and this was an easy one.

I see about 2.5X speed improvement with C++ data structures (Point, Box) in place of R data structures (NumericVector, NumericMatrix). For comparison, I also tested geom_text instead of geom_text_repel and found that geom_text_repel is about 2.4X slower than geom_text.

Unit: milliseconds
geom_text:

     min       lq     mean   median       uq      max neval
 175.033 278.2243 312.1802 283.0723 383.8128 507.6183    10
Unit: milliseconds
optimized, with C++ structs:

      min       lq     mean   median       uq      max neval
 661.7191 676.8665 680.5037 680.9892 684.1769 699.5232    10
Unit: seconds
unoptimized, with R data structures (NumericVector, NumericMatrix):

      min       lq     mean   median       uq      max neval
 1.578968 1.657331 1.674493 1.676227 1.698223 1.758947    10

I'm using microbenchmark with this code:

library(microbenchmark)
library(ggrepel)
microbenchmark({
  set.seed(42)
  d <- mtcars
  d$name <- rownames(mtcars)
  d <- as.data.frame(rbind(d, d))
  d$wt[33:64] <- d$wt[33:64] + 1
  d$mpg[33:64] <- d$mpg[33:64] + 5
  p <- ggplot(d, aes(wt, mpg, label = name)) +
    geom_point(color = 'red') +
    geom_text_repel() +
    theme_classic(base_size = 16)
  print(p)
}, times = 10L)

I would consider this to be a very large dataset for ggrepel at n=64, because it might not be sensible to clutter a plot with so many text labels. With my inefficient O(n^2) algorithm, that's about 2000 * 64 * 64 = 8,192,000 iterations of computing distance and forces. (See here for discussion about advanced algorithms that are O(log n).)

slowkow added a commit that referenced this issue Feb 7, 2016
Instead of using R data structures like:

    NumericVector
    NumericMatrix

use C++ structures like:

    typedef struct {
    double x, y;
    } Point;

    typedef struct {
    double x1, y1, x2, y2;
    } Box;

The code runs about 2.5X faster on my laptop with this optimization.

This commit addresses issue #23
@slowkow slowkow closed this as completed Feb 7, 2016
@hadley
Copy link
Author

hadley commented Feb 8, 2016

I think the easiest way to avoid the O(n^2) computation is to divide the plot into a grid, and then only check the current and adjacent cells for overlaps. You can either do this roughly by hand (doing e.g. an 8 x 8 grid), or use something more complicated like a quadtree. Alternatively, you could do the full n x n sweep to build up a list of neighbours, but only update it (say) every 100 iterations.

Another thing you could do to make life a little easier would be to parameterise by iteration time (i.e. number of seconds) rather than number of iterations.

@slowkow
Copy link
Owner

slowkow commented Feb 8, 2016

Thanks for the tip about using a grid. I think that is indeed the simplest approach to achieve faster neighbour comparison. Since I expect a small number of labels, I'm skeptical if the performance gain will be worth the implementation complexity. In the future, this may become more important for a geom_jitter alternative (#20) that may have to deal with thousands of points.

I tried using inline on the euclid function and that seemed to save 82ms or so in the previously mentioned benchmark:

      min       lq     mean   median       uq      max neval
 567.9925 587.5553 601.3577 598.3028 609.5485 672.4328    25

However, I'm not seeing any improvement on top of that when I avoid the sqrt() call.

I'd be very happy to review pull requests that improve performance! If anyone reading this cares to give it a shot, that would be delightful.

@hadley
Copy link
Author

hadley commented Feb 8, 2016

That said, ordering by x might be just as simple, and would probably be faster.

I'm likely to come back to this in a couple of months - I'm going to be working on gggeom which will provide the data structures and efficient C++ code that will power ggvis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants