Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter() and conditions of different length #2605

Closed
krlmlr opened this issue Mar 31, 2017 · 21 comments
Closed

filter() and conditions of different length #2605

krlmlr opened this issue Mar 31, 2017 · 21 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@krlmlr
Copy link
Member

krlmlr commented Mar 31, 2017

filter() currently works by appending the conditions to each other using &. This is unsafe: In the following example I'd rather see an error than any result:

library(dplyr, warn.conflicts = FALSE)
iris %>% filter(Sepal.Length > 6, c(TRUE, FALSE), c(FALSE, FALSE, TRUE))
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1           7.0         3.2          4.7         1.4 versicolor
#> 2           6.3         3.3          4.7         1.6 versicolor
#> 3           6.2         2.2          4.5         1.5 versicolor
#> 4           6.4         2.9          4.3         1.3 versicolor
#> 5           6.7         3.1          4.7         1.5 versicolor
#> 6           6.5         3.0          5.8         2.2  virginica
#> 7           6.5         3.2          5.1         2.0  virginica
#> 8           6.5         3.0          5.5         1.8  virginica
#> 9           7.7         2.8          6.7         2.0  virginica
#> 10          6.4         2.8          5.6         2.1  virginica
#> 11          6.1         2.6          5.6         1.4  virginica
#> 12          6.7         3.1          5.6         2.4  virginica
#> 13          6.3         2.5          5.0         1.9  virginica

Fixing this also allows reporting error positions for offending expressions, look for expect_error in test-filter.r .

@krlmlr krlmlr added bug an unexpected problem or unintended behavior data frame labels Mar 31, 2017
@hadley
Copy link
Member

hadley commented Mar 31, 2017

Yeah, and it's surprising given that the first entry is checked.

@krlmlr
Copy link
Member Author

krlmlr commented Mar 31, 2017

Well, base & recycles, and filter() checks only the result.

@krlmlr krlmlr modified the milestone: 0.7.3 Aug 16, 2017
@krlmlr
Copy link
Member Author

krlmlr commented Aug 23, 2017

Postponed, because it's better to include this in a bigger release IMO.

@romainfrancois
Copy link
Member

romainfrancois commented Mar 26, 2018

This is the same as

> iris %>% 
+   mutate( cond = Sepal.Length > 6 & c(TRUE, FALSE) & c(FALSE, FALSE, TRUE)) %>%
+   filter(cond)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species cond
1           7.0         3.2          4.7         1.4 versicolor TRUE
2           6.3         3.3          4.7         1.6 versicolor TRUE
3           6.2         2.2          4.5         1.5 versicolor TRUE
4           6.4         2.9          4.3         1.3 versicolor TRUE
5           6.7         3.1          4.7         1.5 versicolor TRUE
6           6.5         3.0          5.8         2.2  virginica TRUE
7           6.5         3.2          5.1         2.0  virginica TRUE
8           6.5         3.0          5.5         1.8  virginica TRUE
9           7.7         2.8          6.7         2.0  virginica TRUE
10          6.4         2.8          5.6         2.1  virginica TRUE
11          6.1         2.6          5.6         1.4  virginica TRUE
12          6.7         3.1          5.6         2.4  virginica TRUE
13          6.3         2.5          5.0         1.9  virginica TRUE

@romainfrancois
Copy link
Member

This happens on the R side with all_exprs :

#' @export
filter.tbl_df <- function(.data, ...) {
  dots <- quos(...)
  if (any(have_name(dots))) {
    bad <- dots[have_name(dots)]
    bad_eq_ops(bad, "must not be named, do you need `==`?")
  } else if (is_empty(dots)) {
    return(.data)
  }

  quo <- all_exprs(!!!dots, .vectorised = TRUE)
  filter_impl(.data, quo)
}

Too late to do anything about this in the C++ side, because the expression is legit then.

@krlmlr
Copy link
Member Author

krlmlr commented Mar 26, 2018

We need to change the R -> C++ interface to fix this.

@romainfrancois
Copy link
Member

> trace( dplyr:::filter.tbl_df, at = 5, tracer = quote(print(quo))) 
Tracing function "filter.tbl_df" in package "dplyr (not-exported)"
[1] "filter.tbl_df"
> iris %>% filter(Sepal.Length > 6, c(TRUE, FALSE), c(FALSE, FALSE, TRUE))
Tracing filter.tbl_df(tbl_df(.data), ...) step 5 
<quosure>
  expr: ^(^Sepal.Length > 6) & (^c(TRUE, FALSE)) & (^c(FALSE, FALSE, TRUE))
  env:  base
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           7.0         3.2          4.7         1.4 versicolor
2           6.3         3.3          4.7         1.6 versicolor
3           6.2         2.2          4.5         1.5 versicolor
4           6.4         2.9          4.3         1.3 versicolor
5           6.7         3.1          4.7         1.5 versicolor
6           6.5         3.0          5.8         2.2  virginica
7           6.5         3.2          5.1         2.0  virginica
8           6.5         3.0          5.5         1.8  virginica
9           7.7         2.8          6.7         2.0  virginica
10          6.4         2.8          5.6         2.1  virginica
11          6.1         2.6          5.6         1.4  virginica
12          6.7         3.1          5.6         2.4  virginica
13          6.3         2.5          5.0         1.9  virginica

@romainfrancois
Copy link
Member

iirc expressions used to be passed down to the C++ side, but the code was consequently more complicated.

@krlmlr
Copy link
Member Author

krlmlr commented Mar 26, 2018

I'd suggest to shelve this until after #2311.

@romainfrancois
Copy link
Member

#2311 is done as far as filter is concerned. should we move back to passing quosures down instead of a single one created by all_exprs ?

@krlmlr
Copy link
Member Author

krlmlr commented May 30, 2018

I think that's the best way to proceed: pass down quosures and evaluate them one by one.

There's room for optimization: If the first predicate is TRUE only for a low percentage (5%? 20%? 50%?) of rows, it may be faster to subset and compute the remaining predicates on the subset. But this feels like a deep rabbit hole...

@romainfrancois
Copy link
Member

You can't really do that without analyzing the expressions and this is dangerous. When I'm done with #3526 perhaps we'll have new hybrid handler for things we typically see in filter, x == 1, ...

@krlmlr
Copy link
Member Author

krlmlr commented May 30, 2018

I just meant looking at the value after computing the first predicate, not looking at expressions. Why would you want to analyze the expressions?

@romainfrancois
Copy link
Member

The only thing that could be interesting is if all the values are FALSE

@krlmlr
Copy link
Member Author

krlmlr commented Jun 4, 2018

As discussed:

tbl %>%
  filter(pred1(...), lag(col1) < col1)

and

tbl %>%
  filter(pred1(...)) %>%
  filter(lag(col1) < col1)

are not equivalent.

@krlmlr krlmlr modified the milestones: 0.7.3, 0.8.0 Jun 4, 2018
@romainfrancois romainfrancois self-assigned this Oct 8, 2018
@hadley
Copy link
Member

hadley commented Oct 8, 2018

Let's shelve this for now — I think this is probably best seen a special case of as applying the tidyverse recycling rules to all binary operators. Because if we implement with for filter(df, x, y) we really want to be consistent with filter(df, x & y) and then |, and ==, and +, and ...

And then if we apply the tidyverse recycling rules, should we also apply the new tidyverse coercion rules?

@hadley hadley removed this from the 0.8.0 milestone Oct 8, 2018
@krlmlr
Copy link
Member Author

krlmlr commented Nov 7, 2018

Would it be too pretentious to override &?

`&` <- function(e1, e2) {
  if (length(e1) != length(e2)) {
    if (length(e1) != 1 || length(e2) != 1) stop("Recycling")
  }
  base::`&`(e1, e2)
}


TRUE & FALSE
#> [1] FALSE

c(TRUE, FALSE) & c(FALSE, TRUE, FALSE)
#> Error in c(TRUE, FALSE) & c(FALSE, TRUE, FALSE): Recycling

Created on 2018-11-07 by the reprex package (v0.2.1.9000)

@krlmlr
Copy link
Member Author

krlmlr commented Nov 7, 2018

Or can we implement something like Ops.vctr?

@romainfrancois
Copy link
Member

We have yet to formalize what "tidy recycling" is, it probably would have to depend on vctrs::vec_size rather than length

@hadley
Copy link
Member

hadley commented Nov 7, 2018

We should tackle at the same time as #3937

@romainfrancois romainfrancois added this to the 0.9.0 milestone Dec 18, 2018
@hadley
Copy link
Member

hadley commented Dec 10, 2019

Now tracking at tidyverse/funs#33 — my sense is that it would be better to have some way to activate this globally so you don't end up with different results inside and outside of dplyr functions.

@hadley hadley closed this as completed Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants