-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
I just learned that working with NaNs of type long double comes with a 250x performance penalty on my system (Ubuntu, recent Intel CPU):
x_na <- rep(NA_real_, 1e5)
x_zero <- rep(0, 1e5)
options(digits = 3)
microbenchmark::microbenchmark(
sum(x_na),
sum(x_zero),
any(is.na(x_na)),
any(is.na(x_zero))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> sum(x_na) 23102.6 23536.9 24134.3 23926.6 24522 29359 100 b
#> sum(x_zero) 86.7 89.4 99.3 93.6 103 185 100 a
#> any(is.na(x_na)) 97.5 143.9 240.2 151.6 165 2731 100 a
#> any(is.na(x_zero)) 179.4 210.1 262.5 219.0 234 2120 100 aCreated on 2018-01-06 by the reprex package (v0.1.1.9000).
On an old iMac I'm still seeing a 70x slowdown.
This is important, because a user may inadvertently create a column full of NA values and try to aggregate it.
This also affects the hybrid handlers in dplyr. I'm using GCC on Ubuntu, and I have tried various compiler switches to no avail. A coarse search hasn't found anything useful: https://www.startpage.com/do/search?query=r+sum+vector+na+slow.
Adding and storing a value to a long double NaN just seems slow (code). Reference: https://randomascii.wordpress.com/2012/04/21/exceptional-floating-point/.
Can you replicate? Should our hybrid mean() and sum() should check for NA every 1000 iterations or so?