group mutate revamp #109

ppaxisa · 2024-04-06T13:33:39Z

When operating on an ungrouped object, mutate will use methods from {plyranges} but when operating on a grouped object, mutate will rely on vanilla tidyverse by converting the object to a tibble. The result is then coerced back to DataFrame to update the metadata of the object. Of note, core columns can be used in this framework to create new metadata columns.

…mutate rely on old mutate_core and mutate_grp functions

…med unnecessary to pass the test in the first place.

… of S4Vectors classes with mutate so everything is coerced to base::list type.

mikelove · 2024-04-07T12:31:35Z

thanks @ppaxisa!

I'm just doing a little package cleanup today, I plan to merge in your PR on Wednesday (i'm out of office Mon and Tue)

sa-lee · 2024-04-07T23:13:43Z

Is this a breaking change when a user tries to mutate with a column that is an S4 class (one of the original reasons I didn't use base tidyverse for grouped mutate in plyranges, and why it was slow)? I would prefer an implementation that does allow for S4 columns but understand that maybe this is a rare use case and we are willing to make tradeoffs here. @lawremi do you have thoughts on this?

ppaxisa · 2024-04-08T07:51:15Z

I have not formally tested that but yes, I believe this would be a breaking change if attempting to use S4 columns on grouped object (S4 columns would still work on ungrouped object).

mikelove · 2024-04-08T10:58:44Z

I’ll can do testing on Wednesday (and add unit tests for this case). Maybe there is a solution so we deliver speed when we can but not break on S4. ~~Ranges derived from TxDb and EnsDb are common and have S4 columns~~

mikelove · 2024-04-11T22:56:25Z

@sa-lee Pierre-Paul and I are considering:

Upon group_by + mutate:

If metadata cols contains any S4 variable, do the original behavior but show a message that a faster operation is possible without S4 variables (either through select or downgrading)
Provide a convenience function for downgrading S4 columns
Provide the faster dplyr-based mutate only when S4 columns are not present

Thoughts?

sa-lee · 2024-04-11T23:11:23Z

That sounds like a good approach to me. There are vectorised methods for a lot of S4 List classes that group_by() + summarize() takes advantage of, so maybe also worth looking into that code as well to see if anything can be adapted over.

mikelove · 2024-04-16T19:55:30Z

Had some things pop up so didn’t get a chance to implement anything but plan on it later this week or next.

Thinking this will go to next Bioc cycle which is fine — important to get this working well, given the broad impact.

ppaxisa · 2024-04-18T11:22:49Z

If it helps: can you give me some pointers to try and write the convenience function that checks if there are S4 columns in the DataFrame? Not sure where to start but I can spend some time to move forward on this.

mikelove · 2024-04-30T18:23:09Z

To put a concrete case to the current discussion:

https://gist.github.com/mikelove/d788831af3cf76de642ba03af7a0124b?permalink_comment_id=5042008#gistcomment-5042008

mikelove · 2024-04-30T18:23:46Z

I have time again to work on this (really!). I'll report back by end of week.

mikelove · 2024-05-03T19:13:56Z

So I think just a check here that there are no S4 columns is all we need. I can build out the rest.

I was trying to follow the logic of this function and got a little lost. Should it just be an if / else block? Because any(core) and any(!core) aren't complements.

https://github.com/ImmuneAxisa/plyranges/blob/mutate_grp_revamp/R/dplyr-mutate.R#L88-L125

Another ask: because the code went through many iterations, could you make a new branch with your final changes, and can you mark the regions with comments, e.g. # PPA grouped mutate speedup, 2024

mikelove · 2024-05-03T19:16:25Z

> d <- DataFrame(x=1:10, y=11:20, z=IntegerList(as.list(1:10)))
> lapply(d, isS4)
$x
[1] FALSE

$y
[1] FALSE

$z
[1] TRUE

> any(sapply(d, isS4))
[1] TRUE
>

ppaxisa · 2024-05-07T13:13:39Z

Ok I'll review the decision tree for which functions to use, comment that, and create a new branch. This might take a little while, I have limited bandwidth the next 3 weeks. I guess I'll have to close this pull request and submit a new one?

mikelove · 2024-05-07T18:54:27Z

That would be awesome. No time pressure and thanks again for your effort on this.

You can just leave this PR open, and make a new one once you get to it?

sa-lee · 2024-05-08T23:27:42Z

Just chiming into say I'm happy to review an PRs moving forward.

ppaxisa added 10 commits September 10, 2023 10:07

modified mutate function to rely on dplyr for metadata. core columns …

cd72a0b

…mutate rely on old mutate_core and mutate_grp functions

clean up mutate grouping after df conversion

2be51d5

repair id and name to column functions given new mutate function

af4508a

ungroup range object before converting to df, avoids error

05cbd1b

adjusting test to avoid using S4Vectors class with mutate. mutate see…

59957fd

…med unnecessary to pass the test in the first place.

modify expand method for plyranges to accomodate list columns: no use…

8158560

… of S4Vectors classes with mutate so everything is coerced to base::list type.

remove S4Vectors types from example with mutate

6e0ff1b

fix some check errors

43ffccc

vanilla dplyr mutate only on grouped ranges

b3219ec

roll back expand methods and test to use S4vectors List methods

103d762

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group mutate revamp #109

group mutate revamp #109

ppaxisa commented Apr 6, 2024

mikelove commented Apr 7, 2024

sa-lee commented Apr 7, 2024

ppaxisa commented Apr 8, 2024

mikelove commented Apr 8, 2024 •

edited

mikelove commented Apr 11, 2024

sa-lee commented Apr 11, 2024

mikelove commented Apr 16, 2024

ppaxisa commented Apr 18, 2024

mikelove commented Apr 30, 2024

mikelove commented Apr 30, 2024

mikelove commented May 3, 2024 •

edited

mikelove commented May 3, 2024

ppaxisa commented May 7, 2024

mikelove commented May 7, 2024

sa-lee commented May 8, 2024

group mutate revamp #109

Are you sure you want to change the base?

group mutate revamp #109

Conversation

ppaxisa commented Apr 6, 2024

mikelove commented Apr 7, 2024

sa-lee commented Apr 7, 2024

ppaxisa commented Apr 8, 2024

mikelove commented Apr 8, 2024 • edited

mikelove commented Apr 11, 2024

sa-lee commented Apr 11, 2024

mikelove commented Apr 16, 2024

ppaxisa commented Apr 18, 2024

mikelove commented Apr 30, 2024

mikelove commented Apr 30, 2024

mikelove commented May 3, 2024 • edited

mikelove commented May 3, 2024

ppaxisa commented May 7, 2024

mikelove commented May 7, 2024

sa-lee commented May 8, 2024

mikelove commented Apr 8, 2024 •

edited

mikelove commented May 3, 2024 •

edited