Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

morph() to automatically remove columns "used up" by a mutate() #3721

Closed
ArtemSokolov opened this issue Jul 20, 2018 · 32 comments
Closed

morph() to automatically remove columns "used up" by a mutate() #3721

ArtemSokolov opened this issue Jul 20, 2018 · 32 comments
Labels
feature a feature request or enhancement verbs 🏃‍♀️

Comments

@ArtemSokolov
Copy link

Dear dplyr developers,

A recent Stack Overflow question raised an interesting use case of having columns fed to a mutate() call automatically removed from the result. To do this, the mutator would need to parse the input expressions to determine what symbols were used, and I made the first pass at designing such a function. The question author liked my answer and suggested that I contribute it to dplyr.

I am happy to work on a PR with a more robust implementation, but I wanted to check with you if such a feature would align with your design principles and the spirit of the package.

Thanks. Big fan of your work.
-Artem

@krlmlr
Copy link
Member

krlmlr commented Jul 21, 2018

library(tidyverse)

#iris %>% transmutate(Petal.Area = Petal.Width * Petal.Length)
iris %>%
  as_tibble() %>% 
  mutate(Petal.Area = Petal.Width * Petal.Length) %>% 
  select(-Petal.Width, -Petal.Length)
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Species Petal.Area
#>           <dbl>       <dbl> <fct>        <dbl>
#>  1          5.1         3.5 setosa       0.280
#>  2          4.9         3   setosa       0.280
#>  3          4.7         3.2 setosa       0.26 
#>  4          4.6         3.1 setosa       0.3  
#>  5          5           3.6 setosa       0.280
#>  6          5.4         3.9 setosa       0.68 
#>  7          4.6         3.4 setosa       0.42 
#>  8          5           3.4 setosa       0.3  
#>  9          4.4         2.9 setosa       0.280
#> 10          4.9         3.1 setosa       0.15 
#> # ... with 140 more rows

Created on 2018-07-21 by the reprex package (v0.2.0).

Thanks, I'm missing this functionality myself occasionally. Instead of parsing the expression, we could detect and record column access in the C++ code.

What should the verb do in a grouped scenario, if some groups access a different set of columns than other groups?

@ArtemSokolov
Copy link
Author

I think there are two natural options: union - remove all columns that are accessed by at least one group, or intersection - remove only the columns that are accessed by all groups. It might be nice to be able to specify which of the two should be used, but I'm not sure where such an option would go...

@krlmlr krlmlr added the feature a feature request or enhancement label Aug 1, 2018
@krlmlr
Copy link
Member

krlmlr commented Aug 1, 2018

I'd rather support only intersection, but that might be more difficult to implement than union.

The biggest problem I see with both options is that type stability is compromised -- the resulting data frame might end up with different columns, depending on the data. Perhaps the safest thing to do would be to raise an error if different columns are accessed for different groups from this verb.

What naming alternatives do we have? It might be difficult to remember the differences between mutate(), transmute() and transmutate().

@ArtemSokolov
Copy link
Author

ArtemSokolov commented Aug 1, 2018

I agree that perhaps the appropriate action for accessing different columns across groups is to raise an error. I'm not sure I follow the type stability concerns; if the intersection of all accessed columns is removed from each group, is that not a consistent transformation of each group?

For naming alternatives, perhaps we can turn to thesaurus: https://www.thesaurus.com/browse/transmute
If it doesn't become too annoying to type, metamorphose() might be a viable option.

EDIT: A nicer alternative might be alter(), as taken from https://www.thesaurus.com/browse/mutate

@krlmlr
Copy link
Member

krlmlr commented Aug 2, 2018

Suppose we have two group types: X and Y, the mutator code for group X accesses column a, for Y column b. If only groups of type X are present in the data, column a is accessed and removed; if both types are present, both columns a and b are accessed -- which to remove? Both intersect and union produce results inconsistent with the first scenario.

We create something, but also take something else away. How about trade() ?

@ArtemSokolov
Copy link
Author

Thanks for the example, Kirill. That makes sense, and raising an error seems like the best approach to maintain consistency.

I think I have a slight preference towards alter(), because it is semantically similar to mutate() and transmute(). But trade() is a good choice as well!

@krlmlr
Copy link
Member

krlmlr commented Aug 3, 2018

morph() ?

@ArtemSokolov
Copy link
Author

Yes!! morph() is perfect.

@vlepori

This comment has been minimized.

@dekaufman

This comment has been minimized.

@moodymudskipper

This comment has been minimized.

@mkoohafkan

This comment has been minimized.

@krlmlr
Copy link
Member

krlmlr commented Sep 8, 2018

I like the idea of the .keep = "all" argument to mutate(), or perhaps .remove = "none" (with options "other", "used" and "used_once"). But it's rather low priority now.

@mkoohafkan

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@hadley

This comment has been minimized.

@krlmlr

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@lionel-

This comment has been minimized.

@hadley

This comment has been minimized.

@lionel-

This comment has been minimized.

@mkoohafkan

This comment has been minimized.

@mkoohafkan

This comment has been minimized.

@lionel-
Copy link
Member

lionel- commented Feb 8, 2019

I'd suggest transmutate or @krlmlr's suggestion trade.

I find those less descriptive of the proposed feature.

this functionality could be achieved with an argument to transmute or mutate

We're thinking about a new workflow based on tibble return values where morph() might play an important part. In that case, an argument would get in the way of important patterns. See tidyverse/tidyr#523 (comment) for an example.

@lionel-

This comment has been minimized.

@lionel-

This comment has been minimized.

@mkoohafkan

This comment has been minimized.

@lionel-

This comment has been minimized.

@hadley

This comment has been minimized.

@krlmlr

This comment has been minimized.

@lionel-

This comment has been minimized.

@hadley hadley changed the title Proposed feature: transmutate() automatically removes columns that were used in a mutate()-like call. morph() to automatically removes columns "used up" by a mutate() Dec 10, 2019
@hadley hadley changed the title morph() to automatically removes columns "used up" by a mutate() morph() to automatically remove columns "used up" by a mutate() Dec 10, 2019
@hadley
Copy link
Member

hadley commented Jan 17, 2020

A few thoughts on implementation given the recent changes to dplyr internals: Implementing morph() looks to be relatively straightforward — in the binding functions in the data mask, we'd just track record whenever a variable was materialised. (The only trick is to do this in such a way that it doesn't affect ordinary performance, but that's not too hard). @lionel-'s concern about materialising columns doesn't apply to across() since it already avoids materialising all variables by using the full data set in eval_select().

My primary concern is that this will be harder to implement if we switch to using ALTREP slices, but we could probably still have a fallback for morph() that used the old approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

9 participants