morph() to automatically remove columns "used up" by a mutate() #3721

ArtemSokolov · 2018-07-20T17:11:43Z

Dear dplyr developers,

A recent Stack Overflow question raised an interesting use case of having columns fed to a mutate() call automatically removed from the result. To do this, the mutator would need to parse the input expressions to determine what symbols were used, and I made the first pass at designing such a function. The question author liked my answer and suggested that I contribute it to dplyr.

I am happy to work on a PR with a more robust implementation, but I wanted to check with you if such a feature would align with your design principles and the spirit of the package.

Thanks. Big fan of your work.
-Artem

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-07-21T20:54:02Z

library(tidyverse)

#iris %>% transmutate(Petal.Area = Petal.Width * Petal.Length)
iris %>%
  as_tibble() %>% 
  mutate(Petal.Area = Petal.Width * Petal.Length) %>% 
  select(-Petal.Width, -Petal.Length)
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Species Petal.Area
#>           <dbl>       <dbl> <fct>        <dbl>
#>  1          5.1         3.5 setosa       0.280
#>  2          4.9         3   setosa       0.280
#>  3          4.7         3.2 setosa       0.26 
#>  4          4.6         3.1 setosa       0.3  
#>  5          5           3.6 setosa       0.280
#>  6          5.4         3.9 setosa       0.68 
#>  7          4.6         3.4 setosa       0.42 
#>  8          5           3.4 setosa       0.3  
#>  9          4.4         2.9 setosa       0.280
#> 10          4.9         3.1 setosa       0.15 
#> # ... with 140 more rows

Created on 2018-07-21 by the reprex package (v0.2.0).

Thanks, I'm missing this functionality myself occasionally. Instead of parsing the expression, we could detect and record column access in the C++ code.

What should the verb do in a grouped scenario, if some groups access a different set of columns than other groups?

ArtemSokolov · 2018-07-22T03:36:40Z

I think there are two natural options: union - remove all columns that are accessed by at least one group, or intersection - remove only the columns that are accessed by all groups. It might be nice to be able to specify which of the two should be used, but I'm not sure where such an option would go...

krlmlr · 2018-08-01T14:07:11Z

I'd rather support only intersection, but that might be more difficult to implement than union.

The biggest problem I see with both options is that type stability is compromised -- the resulting data frame might end up with different columns, depending on the data. Perhaps the safest thing to do would be to raise an error if different columns are accessed for different groups from this verb.

What naming alternatives do we have? It might be difficult to remember the differences between mutate(), transmute() and transmutate().

ArtemSokolov · 2018-08-01T17:38:35Z

I agree that perhaps the appropriate action for accessing different columns across groups is to raise an error. I'm not sure I follow the type stability concerns; if the intersection of all accessed columns is removed from each group, is that not a consistent transformation of each group?

For naming alternatives, perhaps we can turn to thesaurus: https://www.thesaurus.com/browse/transmute
If it doesn't become too annoying to type, metamorphose() might be a viable option.

EDIT: A nicer alternative might be alter(), as taken from https://www.thesaurus.com/browse/mutate

krlmlr · 2018-08-02T02:51:15Z

Suppose we have two group types: X and Y, the mutator code for group X accesses column a, for Y column b. If only groups of type X are present in the data, column a is accessed and removed; if both types are present, both columns a and b are accessed -- which to remove? Both intersect and union produce results inconsistent with the first scenario.

We create something, but also take something else away. How about trade() ?

ArtemSokolov · 2018-08-02T15:04:04Z

Thanks for the example, Kirill. That makes sense, and raising an error seems like the best approach to maintain consistency.

I think I have a slight preference towards alter(), because it is semantically similar to mutate() and transmute(). But trade() is a good choice as well!

krlmlr · 2018-08-03T07:53:18Z

morph() ?

ArtemSokolov · 2018-08-03T15:56:02Z

Yes!! morph() is perfect.

krlmlr · 2018-09-08T07:53:51Z

I like the idea of the .keep = "all" argument to mutate(), or perhaps .remove = "none" (with options "other", "used" and "used_once"). But it's rather low priority now.

lionel- · 2019-02-08T18:24:41Z

I'd suggest transmutate or @krlmlr's suggestion trade.

I find those less descriptive of the proposed feature.

this functionality could be achieved with an argument to transmute or mutate

We're thinking about a new workflow based on tibble return values where morph() might play an important part. In that case, an argument would get in the way of important patterns. See tidyverse/tidyr#523 (comment) for an example.

hadley · 2020-01-17T21:34:41Z

A few thoughts on implementation given the recent changes to dplyr internals: Implementing morph() looks to be relatively straightforward — in the binding functions in the data mask, we'd just track record whenever a variable was materialised. (The only trick is to do this in such a way that it doesn't affect ordinary performance, but that's not too hard). @lionel-'s concern about materialising columns doesn't apply to across() since it already avoids materialising all variables by using the full data set in eval_select().

My primary concern is that this will be harder to implement if we switch to using ALTREP slices, but we could probably still have a fallback for morph() that used the old approach.

krlmlr added the feature a feature request or enhancement label Aug 1, 2018

This comment has been minimized.

Sign in to view

mkoohafkan mentioned this issue Oct 5, 2018

Allow ungroup to specify removal of grouping variable #3760

Closed

This comment has been minimized.

Sign in to view

This was referenced Feb 8, 2019

pack() / unpack() tidyverse/tidyr#523

Closed

Possible extract() + rematch2 happiness tidyverse/tidyr#548

Closed

This comment has been minimized.

Sign in to view

lionel- mentioned this issue Feb 13, 2019

Support functions in selections r-lib/tidyselect#86

Closed

This comment has been minimized.

Sign in to view

lionel- mentioned this issue Feb 22, 2019

tibble() should auto-splice unnamed tibble columns tidyverse/tibble#581

Closed

hadley mentioned this issue Mar 8, 2019

Warning when separate() replaces an existing column? tidyverse/tidyr#547

Closed

hadley changed the title ~~Proposed feature: transmutate() automatically removes columns that were used in a mutate()-like call.~~ morph() to automatically removes columns "used up" by a mutate() Dec 10, 2019

hadley changed the title ~~morph() to automatically removes columns "used up" by a mutate()~~ morph() to automatically remove columns "used up" by a mutate() Dec 10, 2019

hadley added the verbs 🏃‍♀️ label Dec 11, 2019

hadley mentioned this issue Jan 17, 2020

Give more control over which variables mutate() should keep #4773

Merged

hadley closed this as completed in 7470ed0 Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

morph() to automatically remove columns "used up" by a mutate() #3721

morph() to automatically remove columns "used up" by a mutate() #3721

ArtemSokolov commented Jul 20, 2018

krlmlr commented Jul 21, 2018

ArtemSokolov commented Jul 22, 2018

krlmlr commented Aug 1, 2018

ArtemSokolov commented Aug 1, 2018 •

edited

krlmlr commented Aug 2, 2018 •

edited

ArtemSokolov commented Aug 2, 2018

krlmlr commented Aug 3, 2018

ArtemSokolov commented Aug 3, 2018

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

krlmlr commented Sep 8, 2018

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

lionel- commented Feb 8, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hadley commented Jan 17, 2020 •

edited

morph() to automatically remove columns "used up" by a mutate() #3721

morph() to automatically remove columns "used up" by a mutate() #3721

Comments

ArtemSokolov commented Jul 20, 2018

krlmlr commented Jul 21, 2018

ArtemSokolov commented Jul 22, 2018

krlmlr commented Aug 1, 2018

ArtemSokolov commented Aug 1, 2018 • edited

krlmlr commented Aug 2, 2018 • edited

ArtemSokolov commented Aug 2, 2018

krlmlr commented Aug 3, 2018

ArtemSokolov commented Aug 3, 2018

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

krlmlr commented Sep 8, 2018

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

lionel- commented Feb 8, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hadley commented Jan 17, 2020 • edited

ArtemSokolov commented Aug 1, 2018 •

edited

krlmlr commented Aug 2, 2018 •

edited

hadley commented Jan 17, 2020 •

edited