-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Preferred Column Values After Merge #3267
Comments
Thanks. This looks like an useful feature, I would use it for applying patch tables: partial dictionaries that contain replacement values for only a subset of the data. Maybe we can create new verbs for this operation, because it only seems useful in full joins? update_join <- function(x, y, by) ...
combine_join <- function(x, y, by) ... (Not sure what a good naming choice would be here.) The operations can be implemented much faster when combining values right away during the join. Also, SQL backends could do this very efficiently. We'd have to decide what to do with incompatible data types, I'd say we should be strict and throw an error. What do you think? |
Perhaps stay near |
I'm glad that you like the idea, @krlmlr! I see it as useful in many scenarios (not just in the context of a join). An example for me is that I often work with meta-analysis databases (data are extracted from many journal articles in the scientific literature), and I have similar sets of information from different articles which I'd want to choose the best from. For example, the number of individuals with a measurement may come from a column specific to "number of individuals with a measurement" or it may come from "number of individuals in the study". (And that sounds like your partial dictionary example, with the difference that it originates in a single data set rather than from a join.) Within the context of joins, I see value in My reason for giving the example is: I like the function existing separately so that I can use it at times when I may not just be joining. But, I do often need it with a join, and I can see that it would be much faster performed during the join than post-processing when part of a join. For naming... shrug I don't have a strong preference. Of the choices you list, I like "update" rather than "combine" because that sounds more like the operation occurring to me. (Edit just before sending, I like @JohnMount 's suggestion of For incompatible data types there are two questions:
|
The proposed operations are some join combined with a vectorized reduction operator, by default I'd like to keep an initial implementation as simple as possible, no "new" columns should be allowed on the RHS, and all observations from the LHS are kept. We may want to be able to specify which side takes precedence, this seems to give two separately useful operations (LHS wins: partial dictionary update, RHS wins: upsert). Would that solve your most common use cases? A standalone verb that operates on an already expanded table might be better suited in tidyr. Can you outline a use case? Some of this has been covered in more detail in tidyverse/tidyr#183, and earlier in #2075. I now think performance considerations make it worthwhile to consider an implementation in dplyr, a tidyr interface might add more bells and whistles. |
@krlmlr, I agree that a simpler version makes sense within dplyr and a more bells-and-whistles interface makes sense for tidyr. I'll add into the tidyr discussion. I'm not the best to help with a SQL-friendly implementation, unfortunately. |
Sorry, I think this is out of scope for dplyr, and would be best in tidyr. |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
I often want to join two data.frames and then select the "best" result from the output columns.
This happens when I may have two sources of information with partially overlapping information. One source may be more reliable than the other, so I would prefer to use source 1 if it has a value. If source 1 doesn't have a value, I'd prefer to use the other source.
The function below does what I'm looking for, and I think it would fit in well in dplyr. If of interest, I can generate a pull request with this and tests.
The text was updated successfully, but these errors were encountered: