Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregating vectors with NA values very slow #3288

Closed
krlmlr opened this issue Jan 6, 2018 · 8 comments
Closed

Aggregating vectors with NA values very slow #3288

krlmlr opened this issue Jan 6, 2018 · 8 comments

Comments

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 6, 2018

I just learned that working with NaNs of type long double comes with a 250x performance penalty on my system (Ubuntu, recent Intel CPU):

x_na <- rep(NA_real_, 1e5)
x_zero <- rep(0, 1e5)

options(digits = 3)
microbenchmark::microbenchmark(
  sum(x_na),
  sum(x_zero),
  any(is.na(x_na)),
  any(is.na(x_zero))
)
#> Unit: microseconds
#>                expr     min      lq    mean  median    uq   max neval cld
#>           sum(x_na) 23102.6 23536.9 24134.3 23926.6 24522 29359   100   b
#>         sum(x_zero)    86.7    89.4    99.3    93.6   103   185   100  a 
#>    any(is.na(x_na))    97.5   143.9   240.2   151.6   165  2731   100  a 
#>  any(is.na(x_zero))   179.4   210.1   262.5   219.0   234  2120   100  a

Created on 2018-01-06 by the reprex package (v0.1.1.9000).

On an old iMac I'm still seeing a 70x slowdown.

This is important, because a user may inadvertently create a column full of NA values and try to aggregate it.

This also affects the hybrid handlers in dplyr. I'm using GCC on Ubuntu, and I have tried various compiler switches to no avail. A coarse search hasn't found anything useful: https://www.startpage.com/do/search?query=r+sum+vector+na+slow.

Adding and storing a value to a long double NaN just seems slow (code). Reference: https://randomascii.wordpress.com/2012/04/21/exceptional-floating-point/.

Can you replicate? Should our hybrid mean() and sum() should check for NA every 1000 iterations or so?

@jimhester
Copy link
Member

@jimhester jimhester commented Jan 6, 2018

The R implementation of sum just is very inefficient for this case. In the integer case it escapes early when there is an NA, but there isn't similar code for reals due to some unknown (to me) reason.

x_na <- rep(NA_real_, 1e5)
x_int_na <- rep(NA_integer_, 1e5)
x_zero <- rep(0, 1e5)
x_int_zero <- rep(0L, 1e5)

options(digits = 3)
microbenchmark::microbenchmark(
  sum(x_na),
  sum(x_int_na),
  sum(x_zero),
  sum(x_int_zero),
  any(is.na(x_na)),
  any(is.na(x_zero))
)
#> Unit: nanoseconds
#>                expr      min       lq     mean   median       uq      max
#>           sum(x_na) 13912375 14753496 15685305 15594234 16373233 20977510
#>       sum(x_int_na)      165      406     1276      849     1582     5920
#>         sum(x_zero)    81728    87078   106021    99565   117045   198983
#>     sum(x_int_zero)    81739    88401   107228    96394   123177   212391
#>    any(is.na(x_na))    49145    74900   290212   211006   262992  3876202
#>  any(is.na(x_zero))   149602   213617   365284   325532   382149  3258666

Created on 2018-01-05 by the reprex package (v0.1.1.9000).

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Jan 6, 2018

The R implementation is optimized for the case where there are no NAs, then it's faster to not check. (And NA + x == NA.) But we should review given that long double operations seem to be that expensive. @HenrikBengtsson suggested to look at matrixStats::sum2(): https://github.com/HenrikBengtsson/matrixStats/blob/b30dcd1696bc604e27c4dfee83ab6f86e5e68705/src/sum2_lowlevel_template.h#L37-L52, I wonder if the compiler is smart enough to turn that construct into a nested loop. http://www.jottr.org/2015/06/RCheckUserInterrupt.html suggests that the checkpoint should be a power of two.

@jimhester
Copy link
Member

@jimhester jimhester commented Jan 6, 2018

It does still check for NAN on every sum if na.rm = TRUEhttps://github.com/wch/r-source/blob/bcd26fdf136083c50db708d03d6d98b3735ce0dd/src/main/summary.c#L133 and it has basically no impact on performance where there are no NANs (due to branch point prediction). A 50/50 mix of NAs has the same performance characteristics as all NAs when na.rm = FALSE, and is predictably slower when na.rm = TRUE.

x_na <- rep(NA_real_, 1e5)
x_int_na <- rep(NA_integer_, 1e5)
x_zero <- rep(0, 1e5)
x_int_zero <- rep(0L, 1e5)
x_mixed <- sample(c(rep(NA_real_, 5e4), rep(0, 5e4)))

options(digits = 3)
microbenchmark::microbenchmark(
  sum(x_na),
  sum(x_na, na.rm = TRUE),
  sum(x_mixed),
  sum(x_mixed, na.rm = TRUE),
  sum(x_zero),
  sum(x_zero, na.rm = TRUE),
  any(is.na(x_na)),
  any(is.na(x_zero)),
  sum(x_int_na)
)
#> Unit: nanoseconds
#>                        expr      min       lq     mean   median       uq
#>                   sum(x_na) 13779153 14759008 15799532 15727452 16834953
#>     sum(x_na, na.rm = TRUE)    41465    51013    62473    58393    68518
#>                sum(x_mixed) 13159144 13513737 14851562 14426681 15797568
#>  sum(x_mixed, na.rm = TRUE)   388371   401345   464328   432950   491894
#>                 sum(x_zero)    81766    87021   102548    97362   112198
#>   sum(x_zero, na.rm = TRUE)    82082    86482   104186    94006   114986
#>            any(is.na(x_na))    47789    66788   240891   215108   271170
#>          any(is.na(x_zero))   153429   205304   363503   310066   383792
#>               sum(x_int_na)      176      630     1399     1189     1738

Created on 2018-01-06 by the reprex package (v0.1.1.9000).

Also 1048576 is a 2^20, you can see the assembly produced is equivalent to other powers of 2 https://godbolt.org/g/kLxJtf.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Jan 6, 2018

Yes, na.rm = TRUE is not the problem here. I guess what happens here is that the check for narm is done outside the loop. The NA check is reasonably fast due to branch prediction, but not free either, I have an experimental branch in dplyr to confirm this and can show plots later if you're interested. (I don't think we want this in dplyr, because we can usually expect no NAs for na.rm = FALSE.)

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented May 29, 2018

If checking for NAN is not expensive as @jimhester says, perhaps we can revise Sum<> to exit early:

// special case for REALSXP because it treats NA differently
template <bool NA_RM, typename Index>
struct Sum<REALSXP, NA_RM, Index> {
  static double process(double* ptr,  const Index& indices) {
    long double res = 0;
    int n = indices.size();
    for (int i = 0; i < n; i++) {
      double value = ptr[indices[i]];

      // !NA_RM: we don't test for NA here because += NA will give NA
      // this is faster in the most common case where there are no NA
      // if there are NA, we could return quicker as in the version for
      // INTSXP, but we would penalize the most common case
      if (NA_RM && Rcpp::traits::is_na<REALSXP>(value)) {
        continue;
      }

      res += value;
    }

    return (double)res;
  }
};

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented May 29, 2018

We could. We usually expect no NA values if na.rm = FALSE, so it doesn't seem useful to optimize for this case. Checking costs 2% (according to Jim's measurements, 102 -> 104), but not checking comes with a penalty of 25000% if there happen to be NA values in the vector. I wonder if we can bring down these 2% to even less by checking NA-ness every x iterations, or if we just pay these costs to make the unimportant (but still possible) case faster.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented May 29, 2018

That would complexify I think, and if we want performance, we can always make the sum parallel. I'll update so that we pay for the test and exit early.

@lock
Copy link

@lock lock bot commented Nov 26, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants