Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarise shouldn't force base R functions like sum and mean #3255

Closed
msberends opened this issue Dec 19, 2017 · 20 comments
Closed

summarise shouldn't force base R functions like sum and mean #3255

msberends opened this issue Dec 19, 2017 · 20 comments
Assignees
Labels

Comments

@msberends
Copy link

@msberends msberends commented Dec 19, 2017

The summarise function forces the use of base R functions, instead of the expected behaviour that it would use loaded ('active') functions from the own environment.

# (from another package:)
#' @inherit base::mean
#' @export
mean <- function(x, ..., na.rm = TRUE) {
  base::mean(x, ..., na.rm = na.rm)
}

library(dplyr)
df <- tibble(group = LETTERS[1:10], 
             score = c(runif(3, 0, 1), 
                       NA,
                       runif(6, 0, 1)))

mean(df$score) # now na.rm = TRUE by default
# [1] 0.5479927

df %>% summarise(m = mean(score))
# # A tibble: 1 x 1
#       m
#   <dbl>
# 1    NA

> df %>% summarise(m = mean(score, na.rm = TRUE))
# # A tibble: 1 x 1
#       m
#   <dbl>
# 1    0.5479927

Why not use functions from ones own environment? This mean function comes from another package, which I deliberately loaded, so I would expect it to work in all my code. But apparently, the summarise function is an exception. I think you should not force the use of base R functions like base::mean in dplyr.

You might think that package developers just shouldn't overwrite functions like mean, but even dplyr does this (like intersect, setdiff, setequal, union). That too gives other results than when only base R was loaded, just like you would expect. Hence, you loaded dplyr. Now I ask you to let summarise use the latest loaded functions, not just base R. Another example: if I would write a function to overwrite sum to solve #3189, it wouldn't be used by summarise at default, even when my package was loaded after dplyr.

@JohnMount
Copy link

@JohnMount JohnMount commented Dec 19, 2017

I don't know if this helps, but it appears you can pass in arbitrary functions as long as they don't have an already used name (such as mean()). I have no idea what the list of non-overidable functions is though.

suppressPackageStartupMessages(library(dplyr))

mean <- function(x, ..., na.rm = TRUE) {
  base::mean(x, ..., na.rm = na.rm)
}

mean2 <- mean

df <- tibble(score = c(1, NA, 3))


df %>% summarise(m = mean(score))
#> # A tibble: 1 x 1
#>       m
#>   <dbl>
#> 1    NA
df %>% summarise_at(, .vars="score", .funs = funs(mean))
#> # A tibble: 1 x 1
#>   score
#>   <dbl>
#> 1    NA

df %>% summarise(m = mean2(score))
#> # A tibble: 1 x 1
#>       m
#>   <dbl>
#> 1  2.00
df %>% summarise_at(, .vars="score", .funs = funs(mean2))
#> # A tibble: 1 x 1
#>   score
#>   <dbl>
#> 1  2.00

@msberends
Copy link
Author

@msberends msberends commented Dec 24, 2017

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

Suddenly I thought of summarise_at, hoping that it would be the solution, but it isn't;

df %>% summarise_at(vars(score), mean)
# # A tibble: 1 x 1
#   score
#   <dbl>
# 1    NA

df %>% summarise_at(vars(score), funs(mean, .args = list(na.rm = TRUE)))
# # A tibble: 1 x 1
#       score
#       <dbl>
# 1 0.5479927

This of course works, but isn't really a solution or explanation too:

df %>% summarise(m = MyPackage::mean(score))
# # A tibble: 1 x 1
#           m
#       <dbl>
# 1 0.5479927

@lionel-
Copy link
Member

@lionel- lionel- commented Dec 24, 2017

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

That's because of hybrid evaluation, we use C++ implementations to speed up computations. Maybe we could check that the symbol mean evaluates to the mean function from the base package and only enable hybrid evaluation in that case.

@msberends
Copy link
Author

@msberends msberends commented Dec 25, 2017

@msberends
Copy link
Author

@msberends msberends commented Jan 4, 2018

we use C++ implementations to speed up computations

I got that now, after reading http://r-pkgs.had.co.nz/src.html (which is a great intro btw).
Maybe you could edit the wrapper functions like:

sum <- function(x) {
  .Call('sum', PACKAGE = 'dplyr', x)
}

To:

sum <- function(x) {
  if (identical(sum, base::sum)) {
    .Call('sum', PACKAGE = 'dplyr', x)
  } else {
    sum
  }
}

Something like this? Or create a helper function for all cases?

@lionel-
Copy link
Member

@lionel- lionel- commented Jan 4, 2018

Hybrid evaluation in dplyr doesn't work by masking so that wouldn't work.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 17, 2018

Agree to check if mean evaluates to base::mean before kicking in hybrid evaluation, as suggested by @lionel-.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

Is this then going to be even more expensive than it is already ?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 13, 2018

This check would be run only once, not once per group.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

Sure. So there are these :

void install_simple_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("mean") ] = simple_prototype<dplyr::Mean>;
  handlers[ Rf_install("var") ] = simple_prototype<dplyr::Var>;
  handlers[ Rf_install("sd") ] = simple_prototype<dplyr::Sd>;
  handlers[ Rf_install("sum") ] = simple_prototype<dplyr::Sum>;
}
void install_minmax_handlers(HybridHandlerMap& handlers) {
  handlers[Rf_install("min")] = minmax_prototype<true>;
  handlers[Rf_install("max")] = minmax_prototype<false>;
}
void install_in_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("%in%") ] = in_prototype;
}

Do we need the same for dplyr specific things, like these below, in case e.g. we want to "mask" row_number for example ?

void install_count_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("n") ] = count_prototype;
  handlers[ Rf_install("n_distinct") ] = count_distinct_prototype;
}
void install_nth_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("first") ] = first_prototype;
  handlers[ Rf_install("last") ] = last_prototype;
  handlers[ Rf_install("nth") ] = nth_prototype;
}
void install_window_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("row_number") ] = row_number_prototype;
  handlers[ Rf_install("ntile") ] = ntile_prototype;
  handlers[ Rf_install("min_rank") ] = rank_impl_prototype<dplyr::internal::min_rank_increment>;
  handlers[ Rf_install("percent_rank") ] = rank_impl_prototype<dplyr::internal::percent_rank_increment>;
  handlers[ Rf_install("dense_rank") ] = rank_impl_prototype<dplyr::internal::dense_rank_increment>;
  handlers[ Rf_install("cume_dist") ] = rank_impl_prototype<dplyr::internal::cume_dist_increment>;
}
void install_offset_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("lead") ] = leadlag_prototype<Lead>;
  handlers[ Rf_install("lag") ] = leadlag_prototype<Lag>;
}

I guess these don't need to change

void install_debug_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("verify_hybrid") ] = verify_hybrid_prototype;
  handlers[ Rf_install("verify_not_hybrid") ] = verify_not_hybrid_prototype;
}

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

var and sd come from stats, not base

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 13, 2018

I don't see why we should be changing any of this code. I thought we would test that evaluating the function's symbol in the user's environment matches what we expect (base, stats, ...).?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

Sure, I was dumping this here to show which are dealt with by hybrid.
We don't have a mechanism right now to capture "what we expect"

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

Maybe we can evaluate the symbol, and check if it is in either stats base or dplyr as these are the only 3 used

@lionel-
Copy link
Member

@lionel- lionel- commented Mar 13, 2018

I would rather compare the pointers with objects picked up from the namespaces. Perhaps it makes sense to cache these in a static.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 13, 2018

Do we need to change #3420 to account for the new behavior?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

Yes that's better. We already have this static :

HybridHandlerMap& get_handlers() {
  static HybridHandlerMap handlers;

maybe HybridHandlerMap can be updated from

typedef dplyr_hash_map<SEXP, HybridHandler> HybridHandlerMap;

to something like

typedef dplyr_hash_map<SEXP, std::pair<HybridHandler,SEXP>> HybridHandlerMap;

so that we would do something like this pseudo code:

handlers[ Rf_install("mean") ] = make_pair( simple_prototype<dplyr::Mean>, base::mean )  ;

otherwise we can have another map for the environments, maybe that's less disruptive

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 13, 2018

Instead of std::pair can we add a new class HybridHandler that includes the current typedef and the SEXP as members? I'd rather avoid two maps.

@romainfrancois romainfrancois self-assigned this Mar 13, 2018
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 13, 2018

@krlmlr I see this going just before if (TYPEOF(fun_symbol) != SYMSXP) { so that's orthogonal to #3420 ... mhmm maybe not. anyway, I'll do this with ☕️.

Alright, I'll do this then.

@krlmlr krlmlr closed this in #3446 Mar 21, 2018
krlmlr added a commit that referenced this issue Mar 21, 2018
- Hybrid functions can now be masked by regular R functions to turn off hybrid evaluation (#3255).
@lock
Copy link

@lock lock bot commented Sep 17, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

5 participants