-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow duplicate rows in x to be updated #5588
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I like this enhancement.
- I think we shouldn't fail if rows are missing in
x
- Do we need to update documentation to match the new behavior?
- For now we only have output tests, can we cover this case in the tests too?
R/rows.R
Outdated
@@ -242,12 +246,12 @@ rows_check_key <- function(by, x, y) { | |||
by | |||
} | |||
|
|||
rows_check_key_df <- function(df, by, df_name) { | |||
rows_check_key_df <- function(df, by, df_name, .check_unique = TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to split this function into two functions rows_check_key_names_df()
and rows_check_key_unique_df()
to avoid the new flag?
R/rows.R
Outdated
|
||
bad <- which(is.na(idx)) | ||
if (has_length(bad)) { | ||
if (!all(vec_in(y[key], x[key]))){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need this check?
if (!all(vec_in(y[key], x[key]))){ | |
if (FALSE) { |
If we introduce an asymmetry here between zero and nonzero matches, we contradict recycling rules in vctrs.
The current implementations requires one and only one match. No recycling. It's safe and limiting.
The proposed implementation recycles one match to the number of entries in the target table, but fails if the target table has zero entries.
- Avoiding the check entirely makes us more consistent with recycling rules
- One use case of
rows_update()
might be imputation: we updateNA
or otherwise bad values in existing columns inx
based on a lookup table iny
. Why should that fail when some of the looked-up values are missing inx
? - The SQL version
UPDATE x SET ... FROM x, y WHERE x.key == y.key
also silently ignores missing entries inx
Co-authored-by: Kirill Müller <krlmlr@users.noreply.github.com>
…ck_key_unique_df()
Thanks @krlmlr for the initial review of the draft. Going a little further now. |
Slight tangent - but while we are touching these functions, I think it would be clearer if we pulled this check out into its own function: Lines 237 to 240 in ebb6448
It doesn't really have to do with the key |
if (has_length(bad)) { | ||
rows_check_key_unique_df(y, key, df_name = "y") | ||
|
||
if (any(vctrs::vec_in(y[key], x[key]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being able to update multiple rows in x
with the same key implies that we are okay with duplicate keys.
It might make sense to remove this check, allowing you to also insert a row with a duplicate key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. For insertion we might really care for uniqueness. On the other hand, should rows_insert()
or the underlying storage be responsible for identifying duplicate key violations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also wondering if a strict = TRUE/FALSE
would make sense for the data frame method. If strict = TRUE
, then we can't add a duplicate key to x
. This argument might make sense for other rows_*()
functions too.
pos <- which(!is.na(idx)) | ||
idx <- idx[pos] | ||
|
||
x[pos, names(y)] <- y[idx, ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like these might be useful tests to add
x <- data.frame(a = c(1, 2), b = 1)
y <- data.frame(a = 3, b = 2)
# `y` key that isn't in `x` = no changes
expect_identical(rows_update(x, y, "a"), x)
x <- data.frame(a = c(1, 2, 1), b = 1)
y <- data.frame(a = 1, b = 2)
expect <- data.frame(a = c(1, 2, 1), b = c(2, 1, 2))
# can update duplicate keys in `x`
expect_identical(rows_update(x, y, "a"), expect)
|
||
x[idx, names(y)] <- new_data | ||
x[pos, names(y)] <- map2(x[pos, names(y)], y[idx, ], coalesce) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this should only coalesce
over columns in x
and y
that are not key columns - but I am not sure this would have any practical difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would avoid extra work.
rows_check_key_df(x, key, df_name = "x") | ||
rows_check_key_df(y, key, df_name = "y") | ||
rows_check_key_names_df(x, key, df_name = "x") | ||
rows_check_key_unique_df(x, key, df_name = "x") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems useful to be able to delete multiple rows in x
with the same key, i.e.
x <- data.frame(a = c(1, 1, 1), b = c(1, 1, 2))
y <- data.frame(a = 1, b = 1)
# rows_delete(x, y, by = c("a", "b"))
x[3,]
#> a b
#> 3 1 2
That just continues the theme of this PR which allows x
to have duplicate keys
I wonder if we could make enforcement of key constraints optional? |
I think it makes sense to enforce that I think is makes sense that keys in If implemented, I think this optional key constraint would mainly affect whether or not duplicate Input - I don't think it is Output - Only |
Wanted to +1 this issue as I just hit it as well. It'd be nice to see the fix make it into a dplyr release. In the meantime, |
👍 I would even say that duplicate keys in
👍 This makes a lot of sense, especially when working with databases. And this also exists in databases (e.g. ON CONFLICT DO NOTHING)
I think at least for updating and patching it would be nice to have an option for this. Or there should at least be a function to check whether all keys in |
Note that -> Should |
A summary of the above and some additions from my side: duplicates in
|
One thing we have learned when updating the join API is that it is important to restrict new "checking" arguments to cases that depend on the algorithm itself, and can't otherwise be checked outside that function. Otherwise we end up with an explosion of arguments with no clear bounds on where to stop (i.e. you can always "check" for more things). For example, a user could check that the keys of The same ideas apply here, so I'll build off of @mgirlich's nice summary to outline my proposed plan of action:
This plan means that checking for uniqueness in the keys of
Steps:
|
TL;DR: Any extra checking arguments will cause moderate to severe pain downstream. Let's add checker methods instead. For database (and later perhaps data.table) we have an As of now, I don't see Given these constraints, I propose to add a set of checking functions that check the sanity of a
|
I need to review the other functions, and understand better what they all do, but I think this is legit for #5553
Created on 2020-11-04 by the reprex package (v0.3.0.9001)