Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using n() in nested mutate()/summarize() calls gives unexpected results #2080

Closed
mwillumz opened this issue Aug 19, 2016 · 11 comments
Closed
Labels
bug
Milestone

Comments

@mwillumz
Copy link

@mwillumz mwillumz commented Aug 19, 2016

When transitioning from by_row() to map() approach I've found that several dplyr/purrr/tidyr functions do not evaluate within the map() environment. For instance below I was expecting the value returned by n() in the map() example to match that of the by_row() version. Instead it returns the number of rows of the nested input Temp. This might be intended but I can't think of an obvious way to use dplyr::n() on nested tibbles via map().

library(dplyr); library(purrr); library(tidyr)
data(iris)

Temp <- iris %>%
  group_by(Species) %>%
  nest()

ByRow <- by_row(Temp, function(x){
  x$data[[1]] %>%
    filter(Sepal.Length >= 6) %>%
    summarise(petal_length_avg = mean(Petal.Length),
              obs = n())
}, .to = 'test')

ByRow$test

Map <- mutate(Temp, test = map(data, . %>% 
                                 filter(Sepal.Length >= 6) %>%
                                 summarise(petal_length_avg = mean(Petal.Length),
                                              obs = n())))
Map$test
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 7, 2016

Thanks. What is the expected result? Does it work better with #2190?

devtools::install_github("hadley/dplyr#2190")

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 8, 2016

@hadley: I don't understand the issue here.

@hadley
Copy link
Member

@hadley hadley commented Dec 8, 2016

I think this is a better example:

library(dplyr)
library(purrr)

df <- tibble(x = list(
  tibble(y = 1:2),
  tibble(y = 1:3),
  tibble(y = 1:4)
))

nrows <- function(df) {
  df %>% summarise(n = n()) %>% .[["n"]]
}

df %>%
  mutate(
    n1 = x %>% map_int(nrows),
    n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
  )
#> # A tibble: 3 × 3
#>                  x    n1    n2
#>             <list> <int> <int>
#> 1 <tibble [2 × 1]>     2     3
#> 2 <tibble [3 × 1]>     3     3
#> 3 <tibble [4 × 1]>     4     3

n1 and n2 should be equivalent, but aren't, presumably because dplyr and purrr (or magrittr?) are fighting over what . means,

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 8, 2016

Same behavior with #2190.

@krlmlr krlmlr added the bug label Dec 8, 2016
@krlmlr krlmlr self-assigned this Mar 21, 2017
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 22, 2017

This looks like a too eager substitution by hybrid evaluation. It walks the expression, encounters n() and replaces it with the number of rows:

df %>%
  mutate(
    n1 = x %>% map_int(nrows),
    n2 = x %>% map_int(. %>% summarise(n = 3) %>% .[["n"]])
  )

I don't know how to fix this, short of disabling hybrid evaluation if the expression cannot be evaluated in full by the hybrid evaluator. But this will break e.g. df %>% group_by(id) %>% summarize(n() - 1), unless we implement a regular version of n().

@hadley: Please advise.

@hadley
Copy link
Member

@hadley hadley commented Mar 22, 2017

Ah of course. In that case, n3 and n4 below are more worrying:

df %>%
  mutate(
    n1 = x %>% map_int(nrows),
    n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
    n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
    n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
  )
#> # A tibble: 3 × 5
#>                  x    n1    n2    n3    n4
#>             <list> <int> <int> <int> <int>
#> 1 <tibble [2 × 1]>     2     3     3     3
#> 2 <tibble [3 × 1]>     3     3     3     3
#> 3 <tibble [4 × 1]>     4     3     3     3

I think we can probably leave off resolving this until the next version? I suspect it will require more re-thinking about how the hybrid evaluator works.

@SimonCoulombe
Copy link

@SimonCoulombe SimonCoulombe commented Jan 25, 2018

Adding to this issue, is there any reason why output1 is broken, but not output2 or output3 ?

library(tidyverse)
temp <- data_frame(id = 1:5)
 
temp$data <- 1:5 %>% map(~{
   set.seed(.x)
   data_frame(group = sample(1:2, size = .x, replace = TRUE))
 })
temp

### output1 returns n= 5 everywhere
output1 <- temp %>% mutate(dd = map(data, function(X) {
   X %>% 
     group_by(group) %>%
     summarise(n = n())
 })) %>% unnest(dd) 
output1
  
### output2 returns the correct n
output2 <-  temp %>% mutate(dd = map(data, function(X) {
    X %>% 
     count(group)
  })) %>% unnest(dd)  


### output3  returns the correct n
ffs <- function(X){
    X %>% 
      group_by(group) %>%
      summarise(n = n()) 
  }
output3 <- temp %>% mutate(dd = map(data, ffs)) %>% unnest(dd)

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 5, 2018

I'll add this version to the mix to create the Magritte lambda outside of the mutate:

nrows_magrittr_lambda <- . %>% summarise(n = n()) %>% .[["n"]]

I've added the hybrid label.

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

df <- tibble(x = list(
  tibble(y = 1:2),
  tibble(y = 1:3),
  tibble(y = 1:4)
))

nrows <- function(df) {
  df %>% summarise(n = n()) %>% .[["n"]]
}

nrows_magrittr_lambda <- . %>% summarise(n = n()) %>% .[["n"]]

trace( dplyr:::mutate.tbl_df, tracer = quote(print(dots)), at = 3 )
#> Tracing function "mutate.tbl_df" in package "dplyr
#> (not-exported)"
#> [1] "mutate.tbl_df"
mutate( df,
        n1 = x %>% map_int(nrows),
        n5 = map_int(x, nrows_magrittr_lambda),

        n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
        n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
        n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
)
#> Tracing mutate.tbl_df(df, n1 = x %>% map_int(nrows), n5 = map_int(x,  .... step 3 
#> $n1
#> <quosure>
#>   expr: ^x %>% map_int(nrows)
#>   env:  global
#> 
#> $n5
#> <quosure>
#>   expr: ^map_int(x, nrows_magrittr_lambda)
#>   env:  global
#> 
#> $n2
#> <quosure>
#>   expr: ^x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
#>   env:  global
#> 
#> $n3
#> <quosure>
#>   expr: ^map_int(x, ~summarise(., n = n())[["n"]])
#>   env:  global
#> 
#> $n4
#> <quosure>
#>   expr: ^map_int(x, function(df) summarise(df, n = n())[["n"]])
#>   env:  global
#> # A tibble: 3 x 6
#>   x                   n1    n5    n2    n3    n4
#>   <list>           <int> <int> <int> <int> <int>
#> 1 <tibble [2 × 1]>     2     2     3     3     3
#> 2 <tibble [3 × 1]>     3     3     3     3     3
#> 3 <tibble [4 × 1]>     4     4     3     3     3

Created on 2018-03-05 by the reprex package (v0.2.0).

@krlmlr krlmlr changed the title dplyr function scoping not as expected within map() Using n() in nested mutate()/summarize() calls uses value for outer data frame Mar 14, 2018
@krlmlr krlmlr changed the title Using n() in nested mutate()/summarize() calls uses value for outer data frame Using n() in nested mutate()/summarize() calls gives unexpected results Mar 14, 2018
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 14, 2018

I finally got it: The n() is the problem here, when used with nested data manipulation verbs.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 26, 2018

Indeed. hybrid simplification is too eager. With this debugging:

Call GroupedHybridCall::simplify(const SlicingIndex& indices) const {
  set_indices(indices);
  Call call = clone(original_call);
  Rprintf( "call: ") ; Rf_PrintValue(call) ;
  while (simplified(call)) {}
  clear_indices();
  Rprintf( "simplified call: " ) ; Rf_PrintValue(call) ;
  Rprintf("--------\n");

  return call;
}

I get:

> library(dplyr)
> mutate( df,
+         n1 = x %>% map_int(nrows),
+         n5 = map_int(x, nrows_magrittr_lambda),
+ 
+         n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
+         n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
+         n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
+ )
call: x %>% map_int(nrows)
simplified call: x %>% map_int(nrows)
--------
call: map_int(x, nrows_magrittr_lambda)
simplified call: map_int(x, nrows_magrittr_lambda)
--------
call: x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
simplified call: x %>% map_int(. %>% summarise(n = 3L) %>% .[["n"]])
--------
call: map_int(x, ~summarise(., n = n())[["n"]])
simplified call: map_int(x, ~summarise(., n = 3L)[["n"]])
--------
call: map_int(x, function(df) summarise(df, n = n())[["n"]])
simplified call: map_int(x, function(df) summarise(df, n = 3L)[["n"]])
--------
# A tibble: 3 x 6
  x                   n1    n5    n2    n3    n4
  <list>           <int> <int> <int> <int> <int>
1 <tibble [2 × 1]>     2     2     3     3     3
2 <tibble [3 × 1]>     3     3     3     3     3
3 <tibble [4 × 1]>     4     4     3     3     3

@lock
Copy link

@lock lock bot commented Mar 13, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Mar 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug
Projects
None yet
Development

No branches or pull requests

5 participants