Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select helpers dropping all before FALSE entry #2275

Closed
r2evans opened this issue Nov 30, 2016 · 18 comments
Closed

select helpers dropping all before FALSE entry #2275

r2evans opened this issue Nov 30, 2016 · 18 comments
Labels

Comments

@r2evans
Copy link
Contributor

@r2evans r2evans commented Nov 30, 2016

The "select helper" functions appear to be a bit too aggressive about excluding everything before it. This may be a documentation thing, but I expect positive helper functions to be additive (which appears to be the case), and "none-found" helper functions to not be subtractive.

Note: in all examples, I'm using summarize_at (and mean) primarily for terse output. The behavior is the same with other helpers (e.g., matches, starts_with) and within other verbs (e.g., select, mutate_at).

Currently-Correct Behavior

For the record, "all matches found" works fine, as does a not-found match when it is the first one:

### expecting am, gear, vs
summarize_at(mtcars, vars(contains("am"), contains("gear"), contains("vs")), funs(mean))
#        am   gear     vs
# 1 0.40625 3.6875 0.4375

### expecting gear, vs
summarize_at(mtcars, vars(contains("FOO"), contains("gear"), contains("vs")), funs(mean))
#     gear     vs
# 1 3.6875 0.4375

Expected Behavior

I have been interpreting select(mtcars, contains("am"), contains("foo")) to return all columns that contain "am" or "foo", regardless of the order I provide the calls to contains (and other helpers).

Errant Behavior

Everything before a failing conditional is dropped, regardless of match:

### expecting am, gear
summarize_at(mtcars, vars(contains("am"), contains("gear"), contains("FOO")), funs(mean))
# data frame with 0 columns and 0 rows

### expecting am, vs
summarize_at(mtcars, vars(contains("am"), contains("FOO"), contains("vs")), funs(mean))
#       vs
# 1 0.4375

This appears to be because the helper functions, when nothing is found, return the negative vector of column indices:

contains("a", vars = colnames(mtcars))
# [1]  5  9 10 11
contains("FOO", vars = colnames(mtcars))
#  [1]  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10 -11
@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Nov 30, 2016

As I keep thinking about this, I can side-step this by using something like matches("am|foo|vs"), is that the intent? If these helpers are not intended to be chained like this, then I suggest/request specific behavior based on "remove anything that matches this", "remove anything that doesn't match this", and "do something specific when nothing is found".

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 30, 2016

Looking at the history, #1176 might be related. Supporting both use cases looks tricky.

@krlmlr krlmlr added the bug label Nov 30, 2016
@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Nov 30, 2016

Thanks, I was confident there would be something related, but I didn't find that one.

BLUF: I think contains() should be returning integer(0) instead of -seq_along(vars), and the responsibility for dealing with an empty match lies with the verb function instead of the helper.

The current behavior is currently inconsistent between named (and searched) columns versus explicit integers. For example:

df <- data.frame(a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8)

### these two are logically the same with same results
select(df, integer(0), 7)
#   g
# 1 7
select(df, matches('i'), matches('g'))
#   g
# 1 7

### these two are logically the same with different results
select(df, matches('g'), matches('i'))
# data frame with 0 columns and 1 row
select(df, 7, integer(0))
#   g
# 1 7

The original problem (#1176) appears to be not with the return value of an individual select-helper function, but with the base-case that the enclosing verb assumes. For instance, the argument posed in #1176 is that it's the responsibility of contains to decide to return -seq_along(vars), whereas it's actually the enclosing verb function that should be deciding to do or not do that.

When contains is complete, it is possibly only part-way through the selection criteria, so it cannot know the proper way to proceed. In select(df, contains("foo")), contains should succinctly return exactly the indices that match; in this case, it's "none". It doesn't need to know how it interacts with other selectors (integer, explicit, or functional), and shouldn't need to care. When select() combines the results of all of its selectors, only then should a base-case be considered and applied.

I believe the verb function should define the base case based on the first selector; that is, if it is an "inclusive" selector, then the base-case is integer(0) (nothing selected). If the first selector is an "exclusive" selector, then the base-case is seq_along(vars). (This can be explicitly overridden by the user with select(df, everything(), ...) or -everything().)

So, I think the previous behavior of contains (returning integer(0) if empty) is correct, and the enclosing function should be extended to deal with the base-case. It's actually already doing it correctly with integer arguments; it becomes inconsistent when incorporating non-integer selectors:

select(df, -(1:8))
# data frame with 0 columns and 1 row

### correct
select(df, -contains('a'))
#   b c d e f g h
# 1 2 3 4 5 6 7 8
select(df, -contains("foo"))
#   a b c d e f g h
# 1 1 2 3 4 5 6 7 8

### incorrect, it "undoes" the remove-filtering of the first selector
select(df, -contains('a'), -contains("foo"))
#   b c d e f g h a
# 1 2 3 4 5 6 7 8 1

### correct
select(df, -(1:8), contains("foo"))
# data frame with 0 columns and 1 row
### incorrect and ODD (new bug?)
select(df, -(1:8), -contains("foo"))
#    a  b  c  d  e  f  g  h
# 1 -1 -2 -3 -4 -5 -6 -7 -8

I think this means that fill_out should be removed from each of the helper functions, and its logic added to the verbs.

The documentation from that patch won't necessarily change:

In select(), negating a failed match (e.g. select(mtcars, -contains("x"))) returns all columns, instead of no columns

because select will use the base-case of everything() (due to the exclusive first-selector), remove nothing, and end up with everything still.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 1, 2016

@hadley: Please confirm that

summarize_at(mtcars, vars(contains("am"), contains("FOO"), contains("vs")), funs(mean))

should return both am and vs columns. Currently it returns only vs, I believe this might have worked in 0.4.3 and is broken since #1176.

@hadley
Copy link
Member

@hadley hadley commented Dec 1, 2016

Yes, should contain am and vs

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 1, 2016

@r2evans: Would you like to contribute a PR?

@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Dec 1, 2016

I was going to play with it, I'll let you know if I can make something sooner rather than later.

@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Dec 1, 2016

I know I'll be modifying the select_helper functions. I'm not intimately familiar with the package, so can either of you give quick thoughts on these questions to make sure I'm not missing a major component or edge-case?

  • I will be looking at colwise_() and select_vars_() as prime targets in the patch. I don't think verb functions (e.g., select(), mutate_at()) will need to be modified at all. Can you think (off-hand) of other functions to check?
  • How is it best to determine if the first selector is inclusive/exclusive? I think I can use the return of vars to find if the first "selector" is inclusive/exclusive, though I suspect that "-" == substr(as.character(vars(-matches("am"))[[1]])[[1]], 1, 1) is a bit brute-force (and perhaps not robust).

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 1, 2016

May I suggest you start by writing tests that define the desired behavior? If the tests are currently failing, you can still submit a pull request with a skip("Currently failing") call.

I'm asking this because I'm not sure why the first selector is special. We should therefore agree on the desired behavior first.

@hadley
Copy link
Member

@hadley hadley commented Dec 1, 2016

The first selector is special because it determines the default inclusion list - if it's postive, you start with an empty vector of character names, if it's negative, you need to start with all existing names.

I doubt the bug is in the individual selection helpers — it's more likely to be in the code that combines them all together.

@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Dec 1, 2016

Up-front, if any selectors make a match, then no base-case is required/considered. This "base-case" discussion is only relevant when all selectors find nothing and return integer(0); fill_out() is ensuring that this cannot happen (and is the source of the problem, I think).

First Two Options

Giving it more thought, I can think of two ways to determine what the base-case should be. We can determine it on:

  1. the "first selector" (as mentioned); or
  2. the presence of "any inclusive selector". That is, if all selectors are exclusive then we start with all column names, else (at least one is inclusive) we start with none.

The only time I can see (1) the first selector logic failing is in the use of select_() where the order of .dots is not guaranteed. I have not found a time when (2) any inclusive selector logic would fail, so perhaps it's better (albeit requiring checking the polarity of all selectors vice just the first).

@hadley, your preference?

I doubt the bug is in the individual selection helpers — it's more likely to be in the code that combines them all together.

I agree, though I suspect I can remove fill_out completely, impacting all helpers.

Third Option

Have you ever considered using grepl and logical vectors instead of integer column indices? It would likely mitigate this problem for everything (except a 0-column data.frame).

#' @export
#' @rdname select_helpers
everything <- function(vars = current_vars()) {
  rep(TRUE, times = length(vars))
}

match_vars <- function(needle, haystack) {
  x <- match(needle, haystack)
  x <- x[!is.na(x)]
  seq_along(haystack) %in% x
}

grep_vars <- function(needle, haystack, ...) {
  grepl(needle, haystack, ...)
}

which_vars <- function(needle, haystack) {
  haystack %in% needle
}

This would likely require other changes, but it "should never" be empty, mitigating the need for fill_out and the two bugs. It is perhaps a bit more of a sweeping change, so I'll move forward with one of the first two for now.

@hadley
Copy link
Member

@hadley hadley commented Dec 1, 2016

I'm pretty the logic should be keyed on the first, because you might do include, exclude, include or exclude, include, exclude.

I'm pretty sure the first implementation of this used logical vectors (i.e. boolean algebra instead of set membership). I can't remember why I changed, but it might've been a good reason.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 1, 2016

How does negation work with the selectors? Do we need to use to ! instead of - if we switch to logical vectors?

@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Dec 1, 2016

That would have made sense to me, but I hadn't put that much thought into it yet. If either of you think it's worth investigating, it might be better to make that a separate issue (I'm just thinking that it would touch several more functions, requiring a little more concerted effort.)

@hadley
Copy link
Member

@hadley hadley commented Dec 1, 2016

If we did switch to logical vectors, it would need to be done transparently, so that - continued to work. (This would be matter of evaluating the selection functions in a special environment where - was mapped to !, or doing other computing on the language.)

@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Dec 5, 2016

PR #2289 submitted

@r2evans
Copy link
Contributor Author

@r2evans r2evans commented Dec 5, 2016

I opted for testing all combinations of "single vs multiple", "explicit vs named", and "present vs not-present", and I hit an interesting one:

cn <- setNames(colnames(mtcars), nm = colnames(mtcars))
select_vars(cn, -contains("vs"), contains("mpg"))
#    mpg    cyl   disp     hp   drat     wt   qsec     am   gear   carb 
#  "mpg"  "cyl" "disp"   "hp" "drat"   "wt" "qsec"   "am" "gear" "carb" 

By our "first selector" logic, we should start all "all except 'vs'", to which contains("mpg") will add nothing. @hadley and @krlmlr can you confirm that this is the intended behavior?

@krlmlr krlmlr closed this in #2289 Jan 26, 2017
krlmlr added a commit that referenced this issue Jan 26, 2017
…2289)

* change initial columns selected based on first selector (fix #2264)

* increased permutations of first-selector tests

- single vs multiple selectors
- first selector positive versus negative
- explicit versus matching
- present versus not-present

* added is_negated (for use within select_vars_)

* changed setNames and expect_named calls

* changed expect_named calls

* changed all use of mtcars to 3-column data.frame
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 26, 2017

@r2evans: The behavior you reported above looks reasonable to me.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants