Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow duplicate rows in x to be updated #5588

Closed
wants to merge 5 commits into from

Conversation

romainfrancois
Copy link
Member

I need to review the other functions, and understand better what they all do, but I think this is legit for #5553

library(dplyr)

df1 <- tibble(x = c(1, 1, 2), y = c(2, 3, 5))
df2 <- tibble(x = 1, y = 4)
rows_update(df1, df2, by = "x")
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     4
#> 2     1     4
#> 3     2     5

Created on 2020-11-04 by the reprex package (v0.3.0.9001)

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like this enhancement.

  • I think we shouldn't fail if rows are missing in x
  • Do we need to update documentation to match the new behavior?
  • For now we only have output tests, can we cover this case in the tests too?

R/rows.R Outdated Show resolved Hide resolved
R/rows.R Outdated
@@ -242,12 +246,12 @@ rows_check_key <- function(by, x, y) {
by
}

rows_check_key_df <- function(df, by, df_name) {
rows_check_key_df <- function(df, by, df_name, .check_unique = TRUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to split this function into two functions rows_check_key_names_df() and rows_check_key_unique_df() to avoid the new flag?

R/rows.R Outdated

bad <- which(is.na(idx))
if (has_length(bad)) {
if (!all(vec_in(y[key], x[key]))){
Copy link
Member

@krlmlr krlmlr Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this check?

Suggested change
if (!all(vec_in(y[key], x[key]))){
if (FALSE) {

If we introduce an asymmetry here between zero and nonzero matches, we contradict recycling rules in vctrs.

The current implementations requires one and only one match. No recycling. It's safe and limiting.

The proposed implementation recycles one match to the number of entries in the target table, but fails if the target table has zero entries.

  • Avoiding the check entirely makes us more consistent with recycling rules
  • One use case of rows_update() might be imputation: we update NA or otherwise bad values in existing columns in x based on a lookup table in y. Why should that fail when some of the looked-up values are missing in x?
  • The SQL version UPDATE x SET ... FROM x, y WHERE x.key == y.key also silently ignores missing entries in x

@romainfrancois romainfrancois marked this pull request as ready for review November 5, 2020 14:12
@romainfrancois
Copy link
Member Author

Thanks @krlmlr for the initial review of the draft. Going a little further now.

@DavisVaughan
Copy link
Member

Slight tangent - but while we are touching these functions, I think it would be clearer if we pulled this check out into its own function: rows_check_subset(x, y)

dplyr/R/rows.R

Lines 237 to 240 in ebb6448

bad <- setdiff(colnames(y), colnames(x))
if (has_length(bad)) {
abort("All columns in `y` must exist in `x`.")
}

It doesn't really have to do with the key

if (has_length(bad)) {
rows_check_key_unique_df(y, key, df_name = "y")

if (any(vctrs::vec_in(y[key], x[key]))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being able to update multiple rows in x with the same key implies that we are okay with duplicate keys.

It might make sense to remove this check, allowing you to also insert a row with a duplicate key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. For insertion we might really care for uniqueness. On the other hand, should rows_insert() or the underlying storage be responsible for identifying duplicate key violations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also wondering if a strict = TRUE/FALSE would make sense for the data frame method. If strict = TRUE, then we can't add a duplicate key to x. This argument might make sense for other rows_*() functions too.

pos <- which(!is.na(idx))
idx <- idx[pos]

x[pos, names(y)] <- y[idx, ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like these might be useful tests to add

x <- data.frame(a = c(1, 2), b = 1)
y <- data.frame(a = 3, b = 2)

# `y` key that isn't in `x` = no changes
expect_identical(rows_update(x, y, "a"), x)

x <- data.frame(a = c(1, 2, 1), b = 1)
y <- data.frame(a = 1, b = 2)
expect <- data.frame(a = c(1, 2, 1), b = c(2, 1, 2))

# can update duplicate keys in `x`
expect_identical(rows_update(x, y, "a"), expect)


x[idx, names(y)] <- new_data
x[pos, names(y)] <- map2(x[pos, names(y)], y[idx, ], coalesce)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this should only coalesce over columns in x and y that are not key columns - but I am not sure this would have any practical difference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would avoid extra work.

rows_check_key_df(x, key, df_name = "x")
rows_check_key_df(y, key, df_name = "y")
rows_check_key_names_df(x, key, df_name = "x")
rows_check_key_unique_df(x, key, df_name = "x")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useful to be able to delete multiple rows in x with the same key, i.e.

x <- data.frame(a = c(1, 1, 1), b = c(1, 1, 2))
y <- data.frame(a = 1, b = 1)

# rows_delete(x, y, by = c("a", "b"))
x[3,]
#>   a b
#> 3 1 2

That just continues the theme of this PR which allows x to have duplicate keys

@krlmlr
Copy link
Member

krlmlr commented Nov 5, 2020

I wonder if we could make enforcement of key constraints optional?

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 5, 2020

I think it makes sense to enforce that y always has unique keys (otherwise, how do you know which duplicate row from y to use to update x with?) (no changes needed - we do this in this PR)

I think is makes sense that keys in y that do not exist in x should always be silently ignored when updating/deleting/patching (no changes needed - we do this in this PR).


If implemented, I think this optional key constraint would mainly affect whether or not duplicate x keys are allowed (on the input or output). After further thinking, I don't think we need it, except possibly to constrain the output of rows_insert().

Input - I don't think it is rows_*() responsibility to check that x has duplicate keys on the way in. That sounds like a job for the backend of x.

Output - Only rows_insert() has the potential to add a duplicate key to an x that currently has unique keys. An optional argument to just rows_insert() to prevent/allow inserting a duplicate key might be useful (i.e. I know I started with a unique x, and I want to guarantee that I don't add a duplicate row). I could even see an argument for ignoring keys in y that already exist in x when inserting. So the argument could be duplicates = c("error", "insert", "ignore")

@twest820
Copy link

Wanted to +1 this issue as I just hit it as well. It'd be nice to see the fix make it into a dplyr release.

In the meantime, left_join() %>% transmute() or equivalents provide a workaround.

@mgirlich
Copy link

I don't think it is rows_*() responsibility to check that x has duplicate keys on the way in. That sounds like a job for the backend of x.

👍 I would even say that duplicate keys in x is quite typical for rows_update() and rows_patch(), e.g. when updating some value for a whole country.

I could even see an argument for ignoring keys in y that already exist in x when inserting. So the argument could be duplicates = c("error", "insert", "ignore")

👍 This makes a lot of sense, especially when working with databases. And this also exists in databases (e.g. ON CONFLICT DO NOTHING)

I think is makes sense that keys in y that do not exist in x should always be silently ignored when updating/deleting/patching

I think at least for updating and patching it would be nice to have an option for this. Or there should at least be a function to check whether all keys in y are also in x.

@mgirlich
Copy link

mgirlich commented Oct 1, 2021

Output - Only rows_insert() has the potential to add a duplicate key to an x that currently has unique keys. An optional argument to just rows_insert() to prevent/allow inserting a duplicate key might be useful (i.e. I know I started with a unique x, and I want to guarantee that I don't add a duplicate row). I could even see an argument for ignoring keys in y that already exist in x when inserting. So the argument could be duplicates = c("error", "insert", "ignore")

Note that rows_upsert() is basically just duplicates = "update" (and one could also imagine duplicates = "patch").
Still, I prefer the expressiveness of rows_upsert() over rows_insert(duplicates = "update").

-> Should duplicates = "update" work?
-> Add another verb rows_insert_missing() instead of duplicates = "ignore"?

@mgirlich
Copy link

A summary of the above and some additions from my side:

duplicates in x

  • The default behaviour should be to not check for them. There are feature requests for this. If actually required it should be a task for the backend.
  • Maybe add an argument to check that x is unique.

duplicates in y

  • rows_insert(): there are three ways to handle duplicates
    • error
    • insert the first row of each duplicated group, i.e. insert distinct(y, !!!by, .keep = TRUE)
    • insert everything regardless of duplicates
  • rows_update(), rows_patch() and rows_upsert(): error because it is unclear how to update x
  • rows_delete(): still knows which rows to delete.
    • Maybe add an argument to check that y is unique.

matched y

  • rows_insert() there are three ways to handle duplicates
    • error
    • ignore, i.e. do not insert
    • insert
  • rows_update(), rows_patch(), rows_upsert(), rows_delete(): update/patch/delete rows

unmatched y

  • rows_insert(), rows_upsert(): insert
  • rows_update(), rows_patch(), rows_delete(): do nothing with these rows.
    • Maybe add an argument to error on unmatched rows to be sure that all of y is applied.

rows_insert(duplicates = "insert")

  • is the same as bind_rows(). From a user perspective it still makes sense to have a possibility to insert arbitrary rows with rows_insert(). In particular this is nice for other backends due to in_place = TRUE.
  • An alternative to duplicates = "insert" could be by = character()

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 23, 2022

One thing we have learned when updating the join API is that it is important to restrict new "checking" arguments to cases that depend on the algorithm itself, and can't otherwise be checked outside that function. Otherwise we end up with an explosion of arguments with no clear bounds on where to stop (i.e. you can always "check" for more things). For example, a user could check that the keys of x are unique ahead of time, but they can't check ahead of time to see if a row in x will match multiple rows in y, which is why we are adding a multiple argument to the join functions.

The same ideas apply here, so I'll build off of @mgirlich's nice summary to outline my proposed plan of action:

  1. Duplicates in x should always be allowed, everywhere

    • All 5 rows_*() functions can work with this
    • You can externally check that x has unique keys with enforce() ahead of time
  2. Duplicates in y should always be allowed, if the rows_*() function is still well-defined

    • rows_update/patch/upsert() will error, because duplicates in y aren't well-defined (i.e. what do you update/patch with?)
    • rows_delete() will always work, because duplicates in y is the same as a unique y here
    • rows_insert() will always work, it will just insert multiple y rows into the new result
      • If this is an issue, externally check that y has unique keys with enforce() ahead of time
  3. Matched row in y

    • rows_update/patch/upsert() matching rows in y is the whole point, so these have no special options. Because of point 2, these match at most 1 row.
    • rows_delete(), again, matching rows in y is the point. May match >1 row without issue.
    • rows_insert(), matching rows in y default to an error, with options of matched = c("error", "insert", "ignore").
  4. Unmatched row in y

    • rows_update/patch/delete() having an unmatched y row is not necessarily a bad thing (you typically just won't use that y row), but you may be expecting every row in y to update some row in x, so here we'd add an argument to control this, with options of unmatched = c("ignore", "error").
    • rows_upsert() has a defined behavior when there is an unmatched row in y, it inserts it instead.
    • rows_insert() insert.

This plan means that checking for uniqueness in the keys of x and y remains something that is externally validated in some other way (which is very nice!).

  • In dplyr, validated by enforce()
  • In dbplyr, validated by primary keys on the database tables
  • In dm, I think you can also set primary keys ahead of time

Steps:

  • Let all rows functions allow duplicate x keys
    • With advise on how to check for unique x keys with enforce()
  • Let rows_delete() and rows_insert() allow duplicate y keys
    • With advise on how to check for unique y keys in rows_insert() with enforce()
    • Ensure the other 3 functions error with duplicate y keys
  • Add matched = c("error", "insert", "ignore") to rows_insert()
  • Add unmatched = c("ignore", "error") to rows_update(), rows_patch(), and rows_delete()

@krlmlr
Copy link
Member

krlmlr commented Feb 23, 2022

TL;DR: Any extra checking arguments will cause moderate to severe pain downstream. Let's add checker methods instead.


For database (and later perhaps data.table) we have an in_place = FALSE argument. This feature gives a unique "preview" opportunity for database (and perhaps data.table) operations that would otherwise irrevocably change your data (or block other users if run in a transaction). I think consistency between in_place = TRUE and in_place = FALSE is very important.

As of now, I don't see rows_insert() and friends in dbplyr, they are implemented in dm for now. The implementation for in_place = TRUE hugely depends on the SQL dialect, for in_place = FALSE we can standardize a bit more. For databases, all extra checking usually means extra roundtrips to the database. These checks can be carried out by the database itself if primary keys or unique indices are defined. On the other hand, checking e.g. mismatches of update operations may be difficult.

Given these constraints, I propose to add a set of checking functions that check the sanity of a rows_*() operation, perhaps check_rows_*() ? We can add all the options that we want/need to those checkers. Behavior would be defined as follows:

  • if check_rows_*() succeeds, the behavior of the corresponding rows_*() function is well-defined and gives the same results (ignoring row order) for all in_place settings.
  • if check_rows_*() fails, the behavior is "best effort" -- anything from a failure to duplicate/missing rows is permissible
  • rows_*() never does any checking by itself
  • check_rows_*() never modifies the data
  • check_rows_*() is backend-agnostic, the default methods work unmodified for all but the most exotic backends

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants