Skip to content

Aggregating vectors with NA values very slow #3288

@krlmlr

Description

@krlmlr

I just learned that working with NaNs of type long double comes with a 250x performance penalty on my system (Ubuntu, recent Intel CPU):

x_na <- rep(NA_real_, 1e5)
x_zero <- rep(0, 1e5)

options(digits = 3)
microbenchmark::microbenchmark(
  sum(x_na),
  sum(x_zero),
  any(is.na(x_na)),
  any(is.na(x_zero))
)
#> Unit: microseconds
#>                expr     min      lq    mean  median    uq   max neval cld
#>           sum(x_na) 23102.6 23536.9 24134.3 23926.6 24522 29359   100   b
#>         sum(x_zero)    86.7    89.4    99.3    93.6   103   185   100  a 
#>    any(is.na(x_na))    97.5   143.9   240.2   151.6   165  2731   100  a 
#>  any(is.na(x_zero))   179.4   210.1   262.5   219.0   234  2120   100  a

Created on 2018-01-06 by the reprex package (v0.1.1.9000).

On an old iMac I'm still seeing a 70x slowdown.

This is important, because a user may inadvertently create a column full of NA values and try to aggregate it.

This also affects the hybrid handlers in dplyr. I'm using GCC on Ubuntu, and I have tried various compiler switches to no avail. A coarse search hasn't found anything useful: https://www.startpage.com/do/search?query=r+sum+vector+na+slow.

Adding and storing a value to a long double NaN just seems slow (code). Reference: https://randomascii.wordpress.com/2012/04/21/exceptional-floating-point/.

Can you replicate? Should our hybrid mean() and sum() should check for NA every 1000 iterations or so?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions