New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent of SQL CASE #631

Closed
edwindj opened this Issue Sep 26, 2014 · 35 comments

Comments

Projects
None yet
@edwindj
Contributor

edwindj commented Sep 26, 2014

In data cleaning data often contains values that one would like to impute/replace.

For example:

In a survey for smokers, one would not ask a person his or her cigarette consumption when he/she has indicated not to be a smoker. So technically the number of cigarretes is not measured, but it can be deduced to be zero.

smoke <- read.csv(textConnection(
"name,is_smoker,cigarrettes,cigares
 John,TRUE,15, 0
 Adam,FALSE,NA,NA
 Phil,TRUE,0,1
"
))

#standard R way
smoke$cigarrettes[!smoke$is_smoker & is.na(smoke$cigarrettes)] <- 0

How does one do this elegantly in dplyr?

My 2 cents:

#dplyr option 1: split the data set, fix one part and combine the two
smoker_true <- smoke %>% filter(!is_smoker, is.na(smoke$cigarettes))
# complement
smoker_false <- smoke %>% filter(is_smoker | !is.na(smoke$cigarettes))

smoke2 <- rbind( smoker_true %>% mutate(cigarrettes=0)
               , smoker_false
)

which I don't find elegant dplyr code.

So try a little harder

# dplyr option 2: replace the whole variable and do the filtering in the assignment
smoke2 <- 
  smoke %>%
  mutate(cigarrettes=ifelse(!is_smoker & is.na(cigarrettes), 0, cigarrettes))

It works, but this is not efficient/workable for the database backends and I find it unelegant to
do the filtering in the assignment.

I suggest a new method:
mutate_where that allows for changing values of existing variables of a subset of a tbl.
So something like

smoke2 <-
   smoke %>% mutate_where(.where=!is_smoker&is.na(cigarrettes), cigarettes=0)

Or another option (in analogy with group and grouped_<tbl>) to change filter or introduce filtered_<tbl>/add_filter: setting a filter lets one work on a subset of the data, but one can remove_filter.

smoke2 <-
   smoke %>% 
   add_filter(is_smoker, is.na(cigarrettes) %>%
   mutate(cigarrettes=0) %>%
   remove_filter()

Any better ideas or suggestions?

@edwindj edwindj changed the title from Conditional mutate: mutate_where? to Conditional mutate: mutate_where? Or add_filter/remove_filter? Sep 26, 2014

@hadley

This comment has been minimized.

Member

hadley commented Sep 26, 2014

I really prefer the vector based solution, but it would be much simpler if ifelse had a branch for NA:

smoke2 <- 
  smoke %>%
  mutate(cigarrettes = ifelse(!is_smoker, 0, cigarrettes, 0)
)

The problem is that I don't see how any other approach could work with databases - you can not assume that you have write access to the underlying table.

@edwindj

This comment has been minimized.

Contributor

edwindj commented Sep 26, 2014

Agreed :-)

While I think it is possible to implement it in DBs using temporary tables and multiple SQL statements, I think it is wise not to do so: it will complicate the translate_sql code and furthermore it is rather DB specific how TEMP tables work.

@hadley

This comment has been minimized.

Member

hadley commented Sep 26, 2014

I'd like to approach this problem by figuring out what the 90% of most common in-place mutates do, and then figuring out how to make elegant vectorised replacement functions.

@hadley hadley added this to the 0.4 milestone Sep 30, 2014

@hadley hadley added the feature label Sep 30, 2014

@hadley hadley self-assigned this Sep 30, 2014

@hadley

This comment has been minimized.

Member

hadley commented Oct 1, 2014

I'm thinking of adapting something like derivedFactor() from mosaic: https://github.com/rpruim/mosaic/blob/master/R/derivedFactor.R - not quite sure how it should look for this case yet.

@rpruim any thoughts?

@rpruim

This comment has been minimized.

rpruim commented Oct 2, 2014

I've been thinking about this today and have an idea to float. We could use formulas of the form

expression ~ condition     # use expression when condition is TRUE
expression ~ .             # use expression when no other conditions are TRUE (i.e., default)

to specify the desired variable/vector. This is modeled on mathematical definitions by cases, so below I'll call the function cases(). The smoking example would turn into

smoke %>% mutate( 
  cigs2 = cases( 
    0 ~ ! is_smoker & is.na(cigarrettes), 
    cigarettes ~ .)
  )

And fairly complicated definitions by cases could be constructed similarly.

As with derivedFactor(), we could have options to take the first true condition, the last true condition, or to require that exactly one condition be true. This could subsume derivedFactor if there were an argument that declared the result to be a factor and used expression as labels.

Using formulas makes it easy to have arguments with natural names (rather than prepending with . or .. or _ to avoid colliding with names in a data frame). The formulas could all be part of ... and named arguments could control the overall behavior.

cases <- function( ..., match=c("unique", "first", "last"), factor=FALSE, ordered=FALSE) {
}

What do you think?

@kevinykuo

This comment has been minimized.

kevinykuo commented Oct 2, 2014

Haven't given any thought on the viability for implementation, but here's an idea, inspired by @rpruim and pattern matching:

smoke %>% mutate( 
  cigs2 = cigarettes %>% match_with(
     NA ~ 0,
     . ~ cigarettes)
  )
@rpruim

This comment has been minimized.

rpruim commented Oct 2, 2014

Hmm. Note sure you really want this. Rather, I think its use case would be too restricted.

In your case, you only work with one variable, but in general both the conditions and the expressions might refer to multiple variables. Remember that %>% is just an abbreviation, so you are really proposing

cigs 2 = match_with(cigarettes, NA ~ 0, . ~ cigarettes)

What I'm proposing would work on vectors (which could be evaluated in a data frame or elsewhere) and would make it simple to create either expressions or conditions from multiple vectors. mutate() would take care of evaluating them in the object that is its first argument.

I also think I prefer my original order for the formulas. The expression is more like a response and the condition more like an explanatory variable, so it seems to fit the model of other uses of formulas better.

@hadley

This comment has been minimized.

Member

hadley commented Oct 2, 2014

I think I'd prefer having the result on the RHS:

cases( 
    !is_smoker & is.na(cigarettes) ~ 0,
    TRUE ~ cigarettes)
  )

Using TRUE as a catch all for the default seems reasonable. Or you could use a one-sided formula: ~ cigarettes). I like the use of the formula to do explicit quoting. It also means that you'll be able to program withcases()` because it wouldn't need to use NSE.

Taking an example from `derivedFactor():

cases(
  i1 == 0 ~ "abstinent",
  (i1>0 & i1<=1 & i2<=3 & sex=='female') | (i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate"
  ((i1>1 | i2>3) & sex=='female') | ((i1>2 | i2>4) & sex=='male') = "highrisk"
)

# Or by moving our the "or"s
cases(
  i1 == 0 ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
  (i1>1 | i2>3) & sex=='female') ~ "highrisk"
  (i1>2 | i2>4) & sex=='male') ~ "highrisk"
)

# You could also nest cases to make the male/female distinction clearer
cases(
  i1 == 0 ~ "abstinent",
  sex == 'female' ~ cases(
    i1>0 & i1<=1 & i2<=3 ~ "moderate",
    i1>1 | i2>3  ~ "highrisk"
  ),
  sex == 'male' ~ cases(
    i1>0 & i1<=2 & i2<=4 ~ "moderate",
    i1>2 | i2>4 ~ "highrisk"
  )
)

# Or with a default
cases(
  i1 == 0 ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
  TRUE ~ "highrisk"
)
@matthieugomez

This comment has been minimized.

matthieugomez commented Oct 2, 2014

Creating a new variable (as in hadley's last post) is conceptually quite different from modifying some value of a variable (as in the OP's post) - for instance, default behaviors for what happens with non matched rows should differ (resp NA vs original value). There could two different verbs instead of cases depending on the situation, similarly to Stata recode vs replace

@hadley

This comment has been minimized.

Member

hadley commented Oct 2, 2014

@matthieugomez yes, agreed. In SQL there are two forms of CASE: one works on the values of a single variable, the other takes a list of conditionals (analogous to the case function proposed here).

But the challenge is NA: the usual semantics for equality won't work here. (And there's a general problem for floating point values).

@rpruim

This comment has been minimized.

rpruim commented Oct 2, 2014

One advantage of the expression ~ condition syntax (besides mirroring the way cases are presented notationally in books) is that often the expressions are simpler and more similar to each other, so the code looks more elegant. Also, knowing what the expression is often makes it clearer why the condition is defined the way it is. But the expressions get a bit buried in the condition ~ expression order.

cases(
  "abstitnent" ~ i1 == 0,
  "moderate"   ~ i1>0 & i1<=1 & i2<=3 & sex=='female'),
  "moderate    ~ i1>0 & i1<=2 & i2<=4 & sex=='male'),
  "highrisk"   ~ (i1>1 | i2>3) & sex=='female'),
  "highrisk"   ~ (i1>2 | i2>4) & sex=='male')
)

Whichever way the formulas are used, I like the nesting idea.

@hadley

This comment has been minimized.

Member

hadley commented Oct 2, 2014

@rpruim That is true, but I think it makes nesting less natural, and you can achieve the nice alignment with lots of whitespace

cases(
  i1 == 0                              ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female' ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male'   ~ "moderate",
  (i1>1 | i2>3) & sex=='female'        ~ "highrisk"
  (i1>2 | i2>4) & sex=='male'          ~ "highrisk"
)

@matthieugomez I don't see the advantage of that over the filter based approach. Sure it save a little memory, but at the expense of making the API non-orthogonal. The goal of dplyr isn't to squeeze the last bit of performance out of R, it's to have provide a elegant and efficient API for data manipulation.

@rpruim

This comment has been minimized.

rpruim commented Oct 2, 2014

Hmm... The nesting argument is probably reason enough to prefer condition ~ expression because in that case, the condition becomes simpler and the expression becomes more involved.

@hadley

This comment has been minimized.

Member

hadley commented Oct 2, 2014

To make the semantics clear, I think each vector referenced in cases() would need to be either a scalar, or of the same length. Doing recycling in this scenario is going to be quite complex.

I'm also reasonably certain this will be straightforward to translate to SQL.

@hadley

This comment has been minimized.

Member

hadley commented Oct 22, 2014

Here's a start on an implementation: https://gist.github.com/hadley/2751ba61d1c7f4eaacab

Currently it requires you pass the data frame as the first argument, but I think I can work around that. It's not very efficient, but it's a starting point.

@holstius

This comment has been minimized.

holstius commented Aug 5, 2015

For what it's worth: here's a patch verb that I'm starting to use in my own work:

https://gist.github.com/holstius/cbc4ec40057fbc2d9f4b

In my workflows, I like to separate "patching" from "recoding" or "labeling". Patching is the real-world consequence of deadlines + busy upstream providers — it's not supposed to be there, but it is.

@hadley hadley changed the title from Vectorised switch() to Equivalent of SQL CASE Mar 14, 2016

@hadley

This comment has been minimized.

Member

hadley commented Mar 14, 2016

With the new if_else and coalesce functions, you could now write the original motivating example as

mutate(smoke, cigarettes = if_else(coalesce(is_smoker, FALSE), cigarettes, 0L))

And note that this discussions is really about a sql CASE equivalent for R, not a vectorised switch. The discussion for vectorised switch is now taking place in #1710.

I've also deleted a number of comments in this thread that are no longer pertinent.

@hadley

This comment has been minimized.

Member

hadley commented Mar 14, 2016

And now you can do:

mutate(smoke, cigarettes = if_else(is_smoker, cigarettes, 0L, 0L))

@hadley hadley closed this in b6b794f Mar 14, 2016

@hadley

This comment has been minimized.

Member

hadley commented Mar 14, 2016

I'm pretty happy with this implementation, but I'd love to get everyone's feedback on the name and how it works.

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 14, 2016

These new functions are awesome!

case_when()

It could be named match_cases() or match_patterns().

I find the tilde a bit hard to parse visually with a long list of cases. Maybe allow the := operator? The only thing to do would be to export :=` <- `~.

match_cases(
  x %% 35 == 0 := "fizz buzz",
  x %% 5 == 0 := "fizz",
  x %% 7 == 0 := "buzz",
  TRUE := as.character(x)
)
recode()

Would it be too unsafe to coerce the default value automatically?

recode(x, 5, 4, 3, default = x)
#> Error: `default` has type 'integer' not 'double'

recode(x, "a", "b", "c", default = x)
#> Error: `default` has type 'integer' not 'character'

recode(x, "a", "b", "c", default = NA)
#> Error: `default` has type 'logical' not 'character'

against the more verbose:

recode(x, 5, 4, 3, default = as.double(x))
recode(x, "a", "b", "c", default = as.character(x))
recode(x, "a", "b", "c", default = NA_character_)
@hadley

This comment has been minimized.

Member

hadley commented Mar 14, 2016

I think I prefer ~ for now - you can always add more spaces around it to make it more obvious.

I'm also in an explicit mood at the moment, so I'd prefer to require the default to be correctly typed. Also the logic starts to get a bit hairy if there are no arguments (just the default).

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 14, 2016

I think recode() will be very hard to use with numeric data. It would be great to be able to do:

recode(x, `1` = 5, `2` = 4)

This would ideally work with both integers and integer-like doubles (as often integer data is stored in a double).

@gergness

This comment has been minimized.

Contributor

gergness commented Mar 14, 2016

Agreed, looks great!

What about using the formula notation from case_when for recode? This would help with lionel's point about numeric data, and also could prevent name conflicts for x, default and missing.

x <- c("x", "y", "z")
recode(x, x = "a", y = "b", z = "c")
#>Error: All replacements must be named

hadley added a commit that referenced this issue Mar 15, 2016

Also use names for numeric vars. #631
@lionel- if this is going to work for doubles, it also needs a switch from == to near(). But that seems a bit perilous to me

hadley added a commit that referenced this issue Mar 15, 2016

@kevinushey

This comment has been minimized.

Contributor

kevinushey commented Mar 15, 2016

I also don't like using named arguments for recode() as the names are forced to be / become symbols, whereas ~ or := preserve the LHS type. And recode(x, 1 = 2) is pretty ugly.

@hadley

This comment has been minimized.

Member

hadley commented Mar 15, 2016

@kevinushey yeah, but I don't see that being a particularly common use case. And I think using ~ means that we'll need to do some work to make case_when() work with the hybrid evaluator.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 16, 2016

In 4920c8b, I need to use .$ to access columns when using case_when() in mutate() -- is this intended?

> mtcars %>% tbl_df %>%  mutate(size = case_when(cyl > 6 ~ "big", TRUE ~ "small"))
Error: object 'cyl' not found
> mtcars %>% tbl_df %>%  transmute(size = case_when(cyl > 6 ~ "big", TRUE ~ "small"))
Error: object 'cyl' not found
> mtcars %>% tbl_df %>%  transmute(size = case_when(.$cyl > 6 ~ "big", TRUE ~ "small"))
Source: local data frame [32 x 1]

    size
   (chr)
1  small
2  small
3  small
4  small
5    big
6  small
7    big
8  small
9  small
10 small
..   ...

(Is this meant by "some work with the hybrid evaluator"?)

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 16, 2016

Re formula notation for recode(): It's also a question of API consistency.

@hadley

This comment has been minimized.

Member

hadley commented Mar 16, 2016

@krlmlr see #1719

I think the point of if_else, recode and case_when is not to have exactly equivalent interfaces but to solve a series of increasingly more complicated/general situations that each require progressively more syntax.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 16, 2016

I agree to "more syntax", but to me this doesn't necessary imply that each syntax should be different.

A different implementation of case_when(), that is also consistent with if_else() and perhaps doesn't suffer from #1719, would be

case_when <- function(condition, result, ...) {
  if (missing(condition)) NULL
  else
    if_else(condition, result, case_when(...))
}

(replace the if_else() call with its body).

@sfirke

This comment has been minimized.

Contributor

sfirke commented Apr 20, 2016

re: @krlmlr at #631 (comment), is .$ indeed the intended syntax? I ask because I was going to suggest adding an example of case_when with mutate to the documentation. Use with mutate will be common and some users might need a pointer toward .$. since in their current nested ifelse calls they don't need it.

Example:

mtcars %>% mutate(
  efficiency = case_when(
    .$mpg < 22 ~ "low",
    .$mpg <= 26  ~ "moderate",
    TRUE ~ "high"
  )
)
@hadley

This comment has been minimized.

Member

hadley commented Apr 20, 2016

@sfirke no - the goals is for case_when() to work as is inside mutate().

@bergsmat

This comment has been minimized.

bergsmat commented May 9, 2016

Loving case_when, but encountering results I don't fully understand (dplyr 3074c).

x <- expand.grid(y=5,z = c(T,F))
x <-  group_by(x,z)
mutate(          x, y = case_when(   z ~ n()         )) # (no error)
mutate(          x, y = case_when(   z ~ n(),!z ~ 0L )) # (no error)
mutate( ungroup(x), y = case_when(   z ~ n(),!z ~ 0L )) # object 'z' not found
mutate(          x, y = case_when(   z ~ 1L          )) # object 'z' not found
mutate(          x, y = case_when( .$z ~ 1L          )) # object 'z' not found
@hadley

This comment has been minimized.

Member

hadley commented May 10, 2016

Yes, it doesn't yet work inside mutate.

@mvabl-steve

This comment has been minimized.

mvabl-steve commented Sep 12, 2017

I've been teaching myself R from scratch, basically by undertaking something, then reading posts like these, and trial-and-error based on that. Sometimes I hit a wall and reach out.

I've hit a wall. I have dplyr 0.7 installed. I have a tibble with a column - call it contract_key - I added by applying mutate(coalesce()) to three other columns in the tibble. Here's sample data, which I'll of course refer to later as df:

product,supplier,vendor,quantity,gross,contract_key 168145850127,person1,org1,7,8.14E-04,person1 655393265191,person5,org2,7,0.041861099,person5 655393265191,person3,org3,5,0.019268984,person3 168145850127,person4,org2,4,0.023245343,org2 928893912852,person5,org2,2,0.002285355,person5 NA,person6,org1,3,NA,person6 168145850127,person7,org1,1,NA,person7 350468621217,person8,org2,2,0.004212372,person8 977939797847,person9,org1,1,0.001832018,person9 168145850127,person10,org3,6,0.043907375,168145850127 168145850127,person5,org2,3,0.020960582,168145850127 168145850127,person5,org2,1,0.006923272,person5 350468621217,person11,org1,2,0.002730578,350468621217 350468621217,person5,org2,1,0.005505893,person5 977939797847,person11,org1,18,0.023260808,org1 350468621217,person12,org2,1,0.003525162,org2 168145850127,person13,org4,1,0.001932924,person13 655393265191,person5,org2,7,0.016847417,person5 NA,person14,org5,2,NA,NA 928893912852,person5,org2,2,4.22E-04,person5 168145850127,person5,org2,1,5.82E-04,person5 168145850127,person5,org2,1,0.007305813,168145850127 866586597461,person5,org2,8,0.051104768,866586597461 977939797847,person5,org2,6,0.007783279,person5 866586597461,person5,org2,4,0.0225,866586597461

Here's how I generated contract_key:

df <- df %>% mutate(contract_key = coalesce(product,supplier,vendor))

I now want to add another column that categorizes the contents of contract_key based on which of the three columns provided the content (through coalesce()). So if contract_key ="person5", for example, the new column, contract_level, would be "supplier". And contract_key="org2" would map to contract_level = "vendor", etc.

Essentially, I'll be using contract_level as a join variable to another tibble.

I'm stumped. I've tried if_else, and I see that I shouldn't bother trying case_when (because it's inside mutate()).

It's probably basic R syntax that I just don't know. But if someone supplies the answer, I will backtrace until I figure out what you did. (And I'll have learned a new lesson in R!)

Thanks!

@mvabl-steve

This comment has been minimized.

mvabl-steve commented Sep 12, 2017

thanks @friep. i didn't realize the unwritten rules for SO v git. (Seriously, I didn't. Sincere thanks.)

I'll try SO, and all its attendant downvoting and judging! :)

@lock lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.