allow duplicate rows in x to be updated #5588

romainfrancois · 2020-11-04T17:20:40Z

I need to review the other functions, and understand better what they all do, but I think this is legit for #5553

library(dplyr)

df1 <- tibble(x = c(1, 1, 2), y = c(2, 3, 5))
df2 <- tibble(x = 1, y = 4)
rows_update(df1, df2, by = "x")
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     4
#> 2     1     4
#> 3     2     5

^{Created on 2020-11-04 by the reprex package (v0.3.0.9001)}

krlmlr

Thanks, I like this enhancement.

I think we shouldn't fail if rows are missing in x
Do we need to update documentation to match the new behavior?
For now we only have output tests, can we cover this case in the tests too?

R/rows.R

krlmlr · 2020-11-05T04:02:49Z

R/rows.R

@@ -242,12 +246,12 @@ rows_check_key <- function(by, x, y) {
  by
 }

-rows_check_key_df <- function(df, by, df_name) {
+rows_check_key_df <- function(df, by, df_name, .check_unique = TRUE) {


Is it worth to split this function into two functions rows_check_key_names_df() and rows_check_key_unique_df() to avoid the new flag?

krlmlr · 2020-11-05T04:12:05Z

R/rows.R


-  bad <- which(is.na(idx))
-  if (has_length(bad)) {
+  if (!all(vec_in(y[key], x[key]))){


Do we even need this check?

Suggested change

if (!all(vec_in(y[key], x[key]))){

if (FALSE) {

If we introduce an asymmetry here between zero and nonzero matches, we contradict recycling rules in vctrs.

The current implementations requires one and only one match. No recycling. It's safe and limiting.

The proposed implementation recycles one match to the number of entries in the target table, but fails if the target table has zero entries.

Avoiding the check entirely makes us more consistent with recycling rules

One use case of rows_update() might be imputation: we update NA or otherwise bad values in existing columns in x based on a lookup table in y. Why should that fail when some of the looked-up values are missing in x?

The SQL version UPDATE x SET ... FROM x, y WHERE x.key == y.key also silently ignores missing entries in x

Co-authored-by: Kirill Müller <krlmlr@users.noreply.github.com>

…ck_key_unique_df()

closes #5553

romainfrancois · 2020-11-05T14:12:47Z

Thanks @krlmlr for the initial review of the draft. Going a little further now.

DavisVaughan · 2020-11-05T15:10:53Z

Slight tangent - but while we are touching these functions, I think it would be clearer if we pulled this check out into its own function: rows_check_subset(x, y)

dplyr/R/rows.R

Lines 237 to 240 in ebb6448

    
           bad <- setdiff(colnames(y), colnames(x)) 
        
           if (has_length(bad)) { 
        
             abort("All columns in `y` must exist in `x`.") 
        
           }

It doesn't really have to do with the key

DavisVaughan · 2020-11-05T15:20:10Z

R/rows.R

-  if (has_length(bad)) {
+  rows_check_key_unique_df(y, key, df_name = "y")
+
+  if (any(vctrs::vec_in(y[key], x[key]))) {


I think being able to update multiple rows in x with the same key implies that we are okay with duplicate keys.

It might make sense to remove this check, allowing you to also insert a row with a duplicate key.

Not sure. For insertion we might really care for uniqueness. On the other hand, should rows_insert() or the underlying storage be responsible for identifying duplicate key violations?

I was also wondering if a strict = TRUE/FALSE would make sense for the data frame method. If strict = TRUE, then we can't add a duplicate key to x. This argument might make sense for other rows_*() functions too.

DavisVaughan · 2020-11-05T16:06:45Z

R/rows.R

+  pos <- which(!is.na(idx))
+  idx <- idx[pos]
+
+  x[pos, names(y)] <- y[idx, ]


Something like these might be useful tests to add

x <- data.frame(a = c(1, 2), b = 1) y <- data.frame(a = 3, b = 2) # `y` key that isn't in `x` = no changes expect_identical(rows_update(x, y, "a"), x) x <- data.frame(a = c(1, 2, 1), b = 1) y <- data.frame(a = 1, b = 2) expect <- data.frame(a = c(1, 2, 1), b = c(2, 1, 2)) # can update duplicate keys in `x` expect_identical(rows_update(x, y, "a"), expect)

DavisVaughan · 2020-11-05T16:09:32Z

R/rows.R


-  x[idx, names(y)] <- new_data
+  x[pos, names(y)] <- map2(x[pos, names(y)], y[idx, ], coalesce)


I imagine this should only coalesce over columns in x and y that are not key columns - but I am not sure this would have any practical difference

We would avoid extra work.

DavisVaughan · 2020-11-05T16:22:38Z

R/rows.R

-  rows_check_key_df(x, key, df_name = "x")
-  rows_check_key_df(y, key, df_name = "y")
+  rows_check_key_names_df(x, key, df_name = "x")
+  rows_check_key_unique_df(x, key, df_name = "x")


Seems useful to be able to delete multiple rows in x with the same key, i.e.

x <- data.frame(a = c(1, 1, 1), b = c(1, 1, 2)) y <- data.frame(a = 1, b = 1) # rows_delete(x, y, by = c("a", "b")) x[3,] #> a b #> 3 1 2

That just continues the theme of this PR which allows x to have duplicate keys

krlmlr · 2020-11-05T16:50:18Z

I wonder if we could make enforcement of key constraints optional?

DavisVaughan · 2020-11-05T17:26:57Z

I think it makes sense to enforce that y always has unique keys (otherwise, how do you know which duplicate row from y to use to update x with?) (no changes needed - we do this in this PR)

I think is makes sense that keys in y that do not exist in x should always be silently ignored when updating/deleting/patching (no changes needed - we do this in this PR).

If implemented, I think this optional key constraint would mainly affect whether or not duplicate x keys are allowed (on the input or output). After further thinking, I don't think we need it, except possibly to constrain the output of rows_insert().

Input - I don't think it is rows_*() responsibility to check that x has duplicate keys on the way in. That sounds like a job for the backend of x.

Output - Only rows_insert() has the potential to add a duplicate key to an x that currently has unique keys. An optional argument to just rows_insert() to prevent/allow inserting a duplicate key might be useful (i.e. I know I started with a unique x, and I want to guarantee that I don't add a duplicate row). I could even see an argument for ignoring keys in y that already exist in x when inserting. So the argument could be duplicates = c("error", "insert", "ignore")

twest820 · 2021-05-12T20:58:20Z

Wanted to +1 this issue as I just hit it as well. It'd be nice to see the fix make it into a dplyr release.

In the meantime, left_join() %>% transmute() or equivalents provide a workaround.

mgirlich · 2021-09-14T06:51:04Z

I don't think it is rows_*() responsibility to check that x has duplicate keys on the way in. That sounds like a job for the backend of x.

👍 I would even say that duplicate keys in x is quite typical for rows_update() and rows_patch(), e.g. when updating some value for a whole country.

I could even see an argument for ignoring keys in y that already exist in x when inserting. So the argument could be duplicates = c("error", "insert", "ignore")

👍 This makes a lot of sense, especially when working with databases. And this also exists in databases (e.g. ON CONFLICT DO NOTHING)

I think is makes sense that keys in y that do not exist in x should always be silently ignored when updating/deleting/patching

I think at least for updating and patching it would be nice to have an option for this. Or there should at least be a function to check whether all keys in y are also in x.

mgirlich · 2021-10-01T06:34:43Z

Output - Only rows_insert() has the potential to add a duplicate key to an x that currently has unique keys. An optional argument to just rows_insert() to prevent/allow inserting a duplicate key might be useful (i.e. I know I started with a unique x, and I want to guarantee that I don't add a duplicate row). I could even see an argument for ignoring keys in y that already exist in x when inserting. So the argument could be duplicates = c("error", "insert", "ignore")

Note that rows_upsert() is basically just duplicates = "update" (and one could also imagine duplicates = "patch").
Still, I prefer the expressiveness of rows_upsert() over rows_insert(duplicates = "update").

-> Should duplicates = "update" work?
-> Add another verb rows_insert_missing() instead of duplicates = "ignore"?

mgirlich · 2022-01-31T10:39:31Z

A summary of the above and some additions from my side:

duplicates in `x`

The default behaviour should be to not check for them. There are feature requests for this. If actually required it should be a task for the backend.
Maybe add an argument to check that x is unique.

duplicates in `y`

rows_insert(): there are three ways to handle duplicates
- error
- insert the first row of each duplicated group, i.e. insert distinct(y, !!!by, .keep = TRUE)
- insert everything regardless of duplicates
rows_update(), rows_patch() and rows_upsert(): error because it is unclear how to update x
rows_delete(): still knows which rows to delete.
- Maybe add an argument to check that y is unique.

matched `y`

rows_insert() there are three ways to handle duplicates
- error
- ignore, i.e. do not insert
- insert
rows_update(), rows_patch(), rows_upsert(), rows_delete(): update/patch/delete rows

unmatched `y`

rows_insert(), rows_upsert(): insert
rows_update(), rows_patch(), rows_delete(): do nothing with these rows.
- Maybe add an argument to error on unmatched rows to be sure that all of y is applied.

`rows_insert(duplicates = "insert")`

is the same as bind_rows(). From a user perspective it still makes sense to have a possibility to insert arbitrary rows with rows_insert(). In particular this is nice for other backends due to in_place = TRUE.
An alternative to duplicates = "insert" could be by = character()

DavisVaughan · 2022-02-23T01:31:30Z

One thing we have learned when updating the join API is that it is important to restrict new "checking" arguments to cases that depend on the algorithm itself, and can't otherwise be checked outside that function. Otherwise we end up with an explosion of arguments with no clear bounds on where to stop (i.e. you can always "check" for more things). For example, a user could check that the keys of x are unique ahead of time, but they can't check ahead of time to see if a row in x will match multiple rows in y, which is why we are adding a multiple argument to the join functions.

The same ideas apply here, so I'll build off of @mgirlich's nice summary to outline my proposed plan of action:

Duplicates in x should always be allowed, everywhere
- All 5 rows_*() functions can work with this
- You can externally check that x has unique keys with enforce() ahead of time
Duplicates in y should always be allowed, if the rows_*() function is still well-defined
- rows_update/patch/upsert() will error, because duplicates in y aren't well-defined (i.e. what do you update/patch with?)
- rows_delete() will always work, because duplicates in y is the same as a unique y here
- rows_insert() will always work, it will just insert multiple y rows into the new result
  - If this is an issue, externally check that y has unique keys with enforce() ahead of time
Matched row in y
- rows_update/patch/upsert() matching rows in y is the whole point, so these have no special options. Because of point 2, these match at most 1 row.
- rows_delete(), again, matching rows in y is the point. May match >1 row without issue.
- rows_insert(), matching rows in y default to an error, with options of matched = c("error", "insert", "ignore").
Unmatched row in y
- rows_update/patch/delete() having an unmatched y row is not necessarily a bad thing (you typically just won't use that y row), but you may be expecting every row in y to update some row in x, so here we'd add an argument to control this, with options of unmatched = c("ignore", "error").
- rows_upsert() has a defined behavior when there is an unmatched row in y, it inserts it instead.
- rows_insert() insert.

This plan means that checking for uniqueness in the keys of x and y remains something that is externally validated in some other way (which is very nice!).

In dplyr, validated by enforce()
In dbplyr, validated by primary keys on the database tables
In dm, I think you can also set primary keys ahead of time

Steps:

Let all rows functions allow duplicate x keys
- With advise on how to check for unique x keys with enforce()
Let rows_delete() and rows_insert() allow duplicate y keys
- With advise on how to check for unique y keys in rows_insert() with enforce()
- Ensure the other 3 functions error with duplicate y keys
Add matched = c("error", "insert", "ignore") to rows_insert()
Add unmatched = c("ignore", "error") to rows_update(), rows_patch(), and rows_delete()

krlmlr · 2022-02-23T06:26:38Z

TL;DR: Any extra checking arguments will cause moderate to severe pain downstream. Let's add checker methods instead.

For database (and later perhaps data.table) we have an in_place = FALSE argument. This feature gives a unique "preview" opportunity for database (and perhaps data.table) operations that would otherwise irrevocably change your data (or block other users if run in a transaction). I think consistency between in_place = TRUE and in_place = FALSE is very important.

As of now, I don't see rows_insert() and friends in dbplyr, they are implemented in dm for now. The implementation for in_place = TRUE hugely depends on the SQL dialect, for in_place = FALSE we can standardize a bit more. For databases, all extra checking usually means extra roundtrips to the database. These checks can be carried out by the database itself if primary keys or unique indices are defined. On the other hand, checking e.g. mismatches of update operations may be difficult.

Given these constraints, I propose to add a set of checking functions that check the sanity of a rows_*() operation, perhaps check_rows_*() ? We can add all the options that we want/need to those checkers. Behavior would be defined as follows:

if check_rows_*() succeeds, the behavior of the corresponding rows_*() function is well-defined and gives the same results (ignoring row order) for all in_place settings.
if check_rows_*() fails, the behavior is "best effort" -- anything from a failure to duplicate/missing rows is permissible
rows_*() never does any checking by itself
check_rows_*() never modifies the data
check_rows_*() is backend-agnostic, the default methods work unmodified for all but the most exotic backends

allow duplicate rows in x to be updated

8281e64

romainfrancois requested review from krlmlr and DavisVaughan November 4, 2020 17:20

krlmlr reviewed Nov 5, 2020

View reviewed changes

romainfrancois and others added 3 commits November 5, 2020 08:58

Update R/rows.R

145be8e

Co-authored-by: Kirill Müller <krlmlr@users.noreply.github.com>

split rows_check_key_df() into rows_check_key_names_df() and rows_che…

3244a2d

…ck_key_unique_df()

allow multiple rows with same key in x in rows_*().

b1e2192

closes #5553

romainfrancois marked this pull request as ready for review November 5, 2020 14:12

factor our rows_check_subset(x, y)

7cf7299

DavisVaughan reviewed Nov 5, 2020

View reviewed changes

DavisVaughan mentioned this pull request Sep 16, 2021

rows_patch fails when y has key values that don't occur in x #5984

Closed

mgirlich mentioned this pull request Sep 17, 2021

Add rows_insert_missing() to insert rows not yet in db table cynkra/dm#608

Closed

mgirlich mentioned this pull request Oct 1, 2021

Add duplicates argument to rows_insert() cynkra/dm#637

Closed

mgirlich mentioned this pull request Jan 11, 2022

Add rows_*() verbs tidyverse/dbplyr#748

Merged

DavisVaughan mentioned this pull request Feb 25, 2022

Allow duplicate x keys in rows_*() (and sometimes allow duplicate y keys) #6199

Merged

DavisVaughan closed this in #6199 Mar 2, 2022

DavisVaughan mentioned this pull request Mar 2, 2022

Implement conflict and unmatched arguments for the rows_*() family #6203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow duplicate rows in x to be updated #5588

allow duplicate rows in x to be updated #5588

romainfrancois commented Nov 4, 2020

krlmlr left a comment

krlmlr Nov 5, 2020

krlmlr Nov 5, 2020 •

edited

romainfrancois commented Nov 5, 2020

DavisVaughan commented Nov 5, 2020

DavisVaughan Nov 5, 2020

krlmlr Nov 5, 2020

DavisVaughan Nov 5, 2020

DavisVaughan Nov 5, 2020

DavisVaughan Nov 5, 2020

krlmlr Nov 5, 2020

DavisVaughan Nov 5, 2020

krlmlr commented Nov 5, 2020

DavisVaughan commented Nov 5, 2020 •

edited

twest820 commented May 12, 2021

mgirlich commented Sep 14, 2021

mgirlich commented Oct 1, 2021

mgirlich commented Jan 31, 2022

DavisVaughan commented Feb 23, 2022 •

edited

krlmlr commented Feb 23, 2022


		x[idx, names(y)] <- new_data
		x[pos, names(y)] <- map2(x[pos, names(y)], y[idx, ], coalesce)

allow duplicate rows in x to be updated #5588

allow duplicate rows in x to be updated #5588

Conversation

romainfrancois commented Nov 4, 2020

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krlmlr Nov 5, 2020 • edited

Choose a reason for hiding this comment

romainfrancois commented Nov 5, 2020

DavisVaughan commented Nov 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krlmlr commented Nov 5, 2020

DavisVaughan commented Nov 5, 2020 • edited

twest820 commented May 12, 2021

mgirlich commented Sep 14, 2021

mgirlich commented Oct 1, 2021

mgirlich commented Jan 31, 2022

duplicates in x

duplicates in y

matched y

unmatched y

rows_insert(duplicates = "insert")

DavisVaughan commented Feb 23, 2022 • edited

krlmlr commented Feb 23, 2022

krlmlr Nov 5, 2020 •

edited

DavisVaughan commented Nov 5, 2020 •

edited

duplicates in `x`

duplicates in `y`

matched `y`

unmatched `y`

`rows_insert(duplicates = "insert")`

DavisVaughan commented Feb 23, 2022 •

edited