Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite min and max hybrid handlers #2436

Merged
merged 7 commits into from Feb 20, 2017
Merged

Conversation

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 19, 2017

Now handled by a single template class, with support for all corner cases (empty vector, NA values, ...). For integer input, NA is returned instead of Inf.

Fixes #2305.

@hadley: PTAL.

@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2017

How hard would it be expose this directly as an R function? I'd be interested in a benchmark vs base R and the previous version.

Maybe having four templated alternatives (missing yes/no + int/double) would be faster and lead to simple code (albeit with a small amount of duplicated code)

Loading

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 19, 2017

It's already there: with_hybrid(max(a), a = 1:5) vs. without_hybrid((max)(a), a = 1:5). (Just fixed a few problems in master.)

Current master

library(dplyr, warn.conflicts = FALSE)
set.seed(123)
N <- 1e7
x <- runif(N)
xx <- x
xx[N/2] <- NA

microbenchmark::microbenchmark(
  dplyr:::with_hybrid(max(x), x),
  dplyr:::without_hybrid((max)(x), x),
  dplyr:::with_hybrid(max(xx), xx),
  dplyr:::without_hybrid((max)(xx), xx),
  dplyr:::with_hybrid(max(xx, na.rm = TRUE), xx),
  dplyr:::without_hybrid((max)(xx, na.rm = TRUE), xx),
  times = 20
)
#> Unit: milliseconds
#>                                                 expr      min       lq
#>                       dplyr:::with_hybrid(max(x), x) 55.48891 57.39561
#>                  dplyr:::without_hybrid((max)(x), x) 13.54115 14.00521
#>                     dplyr:::with_hybrid(max(xx), xx) 27.96920 29.19869
#>                dplyr:::without_hybrid((max)(xx), xx) 13.69948 14.61733
#>       dplyr:::with_hybrid(max(xx, na.rm = TRUE), xx) 55.02973 55.85497
#>  dplyr:::without_hybrid((max)(xx, na.rm = TRUE), xx) 13.81365 14.53460
#>      mean   median       uq      max neval cld
#>  60.72320 60.20266 63.49677 67.44705    20   c
#>  16.04345 15.21578 17.77672 22.53380    20 a  
#>  31.27390 30.79920 31.78185 43.49719    20  b 
#>  15.47042 15.32329 15.60884 20.75653    20 a  
#>  61.45400 58.53612 65.52276 77.96244    20   c
#>  18.34514 15.17433 17.60567 43.77499    20 a

This PR

#> Unit: milliseconds
#>                                                 expr      min       lq
#>                       dplyr:::with_hybrid(max(x), x) 60.63637 61.25377
#>                  dplyr:::without_hybrid((max)(x), x) 13.55323 14.14427
#>                     dplyr:::with_hybrid(max(xx), xx) 31.07457 31.26109
#>                dplyr:::without_hybrid((max)(xx), xx) 13.41300 13.94740
#>       dplyr:::with_hybrid(max(xx, na.rm = TRUE), xx) 54.90483 55.64433
#>  dplyr:::without_hybrid((max)(xx, na.rm = TRUE), xx) 13.36041 13.85364
#>      mean   median       uq      max neval  cld
#>  62.70054 62.29523 64.12417 66.02901    20    d
#>  14.92565 14.73436 15.01480 21.73895    20 a   
#>  32.84819 32.19134 32.93270 45.22439    20  b  
#>  14.28596 14.29246 14.68599 14.98137    20 a   
#>  57.13864 56.20387 58.92122 59.57113    20   c 
#>  14.43992 14.24315 14.71131 17.95994    20 a

The na.rm = FALSE case has become a tad slower with this change (with gcc -O2 on my laptop), overall our implementation is slower than base R's by a factor of ~3-4. However, base R doesn't seem to have a shortcut for na.rm = FALSE

I combined the four classes into one to reduce maintenance effort (now and in the future). I'd argue that a minimal run time improvement isn't worth the extra maintenance effort, given that we're already sub-par. I could look at a profiler run if it's important, though.

Loading

@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2017

If the hybrid is so much slower, why do we even have it?

Loading

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 19, 2017

It's faster in a grouped query, because we avoid the roundtrip to the R interpreter and the materialization of the groupwise column values. In other words, we can do max(x[group_idx]) without materializing x[group_idx].

Loading

@krlmlr krlmlr requested a review from hadley Feb 20, 2017
@hadley
Copy link
Member

@hadley hadley commented Feb 20, 2017

Base R is pretty inconsistent:

typeof(median(FALSE))
#> [1] "logical"
typeof(median(c(FALSE, TRUE)))
#> [1] "double"

typeof(median(1L))
#> [1] "integer"
typeof(median(c(1L, 2L)))
#> [1] "double"

I think we should say that the hybrid median always returns a double.

Loading

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

This PR only touches hybrid min and max.

Loading

@hadley
Copy link
Member

@hadley hadley commented Feb 20, 2017

Oops. The equivalent base R code is this:

typeof(min(1L, na.rm = TRUE))
#> [1] "integer"
typeof(min(NA_integer_, na.rm = TRUE))
#> Warning in min(NA_integer_, na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> [1] "double"

typeof(min(TRUE, na.rm = TRUE))
#> [1] "integer"
typeof(min(NA, na.rm = TRUE))
#> Warning in min(NA, na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> [1] "double"

Seems like it would be more consistent to always return a numeric.

Loading

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

Now always returning numeric. New times:

#> Unit: milliseconds
#>                                                 expr      min       lq
#>                       dplyr:::with_hybrid(max(x), x) 57.92209 58.49885
#>                  dplyr:::without_hybrid((max)(x), x) 13.72525 13.98664
#>                     dplyr:::with_hybrid(max(xx), xx) 29.64903 30.68764
#>                dplyr:::without_hybrid((max)(xx), xx) 13.82268 14.15346
#>       dplyr:::with_hybrid(max(xx, na.rm = TRUE), xx) 55.57600 55.79007
#>  dplyr:::without_hybrid((max)(xx, na.rm = TRUE), xx) 13.72621 14.23667
#>      mean   median       uq      max neval  cld
#>  60.72233 59.46146 62.08578 67.16394    20    d
#>  14.77884 14.19997 14.55969 22.61328    20 a   
#>  33.49701 32.23387 33.76267 60.47334    20  b  
#>  15.08331 14.86969 15.44266 18.15982    20 a   
#>  57.59341 56.32895 58.95452 63.45070    20   c 
#>  14.88210 14.83992 15.22170 17.00810    20 a

Loading

hadley
hadley approved these changes Feb 20, 2017
@krlmlr krlmlr merged commit 692480c into tidyverse:master Feb 20, 2017
1 of 2 checks passed
Loading
@krlmlr krlmlr deleted the b-#2305-min-max-inf branch Feb 20, 2017
@lock
Copy link

@lock lock bot commented Jan 18, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Loading

@lock lock bot locked and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants