Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset patching tools #183

Closed
MarcusWalz opened this issue Apr 19, 2016 · 16 comments
Closed

Dataset patching tools #183

MarcusWalz opened this issue Apr 19, 2016 · 16 comments

Comments

@MarcusWalz
Copy link

@MarcusWalz MarcusWalz commented Apr 19, 2016

Formerly filed in dpylr repo.

I'm doing an analysis that requires a lot of manual corrections to the original dataset. I keep tables of changes that mirror the original (see example in code) and then I use the code below to impute the original values. Probably niche, but who am I to decide. Here's my code:

library(dplyr)
library(lazyeval)

update_with = function(data, update, by) {

  # columns to be updated
  # (data intersect cols) -- by

  orig_order = colnames(data)
  update_cols =
    setdiff(intersect(colnames(data), colnames(update)), by)

  # left join update
  data = data %>% left_join(update, by=by)

  # transmutate one by one
  for(col in update_cols) {
    col.x = paste(col, "x", sep=".")
    col.y = paste(col, "y", sep=".")

    myifelse = function(x, y) {
      x[!is.na(y)] <- y[!is.na(y)]
      x
    }
    update = interp( ~myifelse(x, y)
                   , x=as.name(col.x)
                   , y=as.name(col.y)
                   )

    data = data %>%
      mutate_( .dots = setNames(list(update), col) ) %>%
      select( -one_of(c(col.x,col.y)) )
  }

  # deletes extra columns too
  data[,orig_order]
}



a = data.frame(x = 1:100,y = 5, z = 1:50)
b = data.frame(x = 1:10, y = 0)

# sets y to 0 when x  is in range 1:10, otherwise a remains the same 
a %>% update_with(b, by="x")

After using this method for awhile I keep my patches in this format, which assigns value to the cell described by Row and Column.

Row Column Value Data to help fill in value
1 x 123 hint: it's as easy as...
2 y xyz link to pathology report

and I remove the rightmost column use spread on the column Column and feed the subsequent dataframe into update_with. This probably has some unintended side effects when mixing integers and factors together in the value column. This has been surprisingly robust and I've been able to keep a single patch file for each table.

thanks,
Marcus

@hadley
Copy link
Member

@hadley hadley commented May 16, 2016

I like the idea, but that patch form feels a bit fragile to me - if the row order changes you'll silently patch the wrong values. I think it would be better to supply the value of some keys:

mtcars2 <- tibble::rownames_to_column(mtcars, "model")

patches <- frame_data(
  ~ model, ~ column, ~value, ~comment,
  "Mazda RX4", "mpg", 20, "Bad copy & paste"
)

mtcars %>% patch(patches)

This would require the yet unimplemented key-checking functions from dplyr to make sure that there was a unique match for each key.

@jennybc have you thought about this at all?

@jennybc
Copy link
Member

@jennybc jennybc commented May 16, 2016

I've only thought about it up to ... agreeing that it would be really useful to have better tools for patching!

I have really awkward workflows for this and it seems nice to think of it as a novelty join, which is how I interpret your comment.

I'd been planning to delve into the new dplyr::case_when() the next time this came up. But re-skimming that issue (tidyverse/dplyr#631) and reading the help file makes me think case_when() doesn't really solve it? I just recalled that patching did come up explicitly over there.

Why would this go in tidyr vs dplyr?

@hadley
Copy link
Member

@hadley hadley commented May 17, 2016

Hmmm, interesting - that patch() feels more like a "mutate_when", which I'm not currently super big on. This patch() feels more like the sort of thing you'd want to create (either by hand or with an add-in) when you discover some bad values in your input.

@jennybc
Copy link
Member

@jennybc jennybc commented May 19, 2016

What I meant by "novelty join:" Like left join, but instead of getting mpg.x, mpg.y, wt.x, and wt.y, we get just mpg and wt in the result, with the values in patches taking precedence. The proposals are pretty different in terms of how to patch more than one variable.

library(dplyr)
(mtcars2 <- tibble::rownames_to_column(mtcars, "model") %>%
  head(3) %>% select(model, mpg, cyl, hp, drat, wt))
#>           model  mpg cyl  hp drat    wt
#> 1     Mazda RX4 21.0   6 110 3.90 2.620
#> 2 Mazda RX4 Wag 21.0   6 110 3.90 2.875
#> 3    Datsun 710 22.8   4  93 3.85 2.320

patches <- frame_data(
      ~ model, ~ mpg, ~ wt,
  "Mazda RX4",   500,  200
)

## fiction!
mtcars2 %>%
  patch(patches, by = "model")
#>           model   mpg cyl  hp drat    wt
#> 1     Mazda RX4 500.0   6 110 3.90 200.0
#> 2 Mazda RX4 Wag  21.0   6 110 3.90 2.875
#> 3    Datsun 710  22.8   4  93 3.85 2.320

BTW, assuming I understand what you mean by "mutate_when", I think one of those just came up on SO.

@hadley
Copy link
Member

@hadley hadley commented May 19, 2016

@jennybc yes, that's what I was picturing too - except I'm not sure if you want to patch individual variables or whole rows. In your scenario I guess you'd use NA in patches to indicate that you didn't want to replace the value? I can imagining patching either cells or rows to be useful in different scenarios. It would be easy to convert between the two forms with gather/spread, so it wouldn't matter too much which was primary.

@jennybc
Copy link
Member

@jennybc jennybc commented Jun 1, 2016

Recording a link to a related Twitter discussion. I think it's another example of the type of problem solved by the patching discussed here: "join and update columns, instead of duplicating them". It's an example where it's natural to update multiple variables at once.

In #Rstats, how to merge DFs (X,Y) with columns from Y replacing (overwriting) those having same name in X?
X has 10 cols A-J. Y has 3 cols F,G,H. Want cols from Y to replace cols in X that have same name; keep other cols from X.
Standard merge will give me new columns F.y, G.y, and H.y.

@MarcusWalz
Copy link
Author

@MarcusWalz MarcusWalz commented Jul 20, 2016

Hmm, I'm having to patch a lot in my current project that's a combination of an ETL process and manual data entry for variables my ETL pipeline doesn't capture. It's patches on top of patches on top of patches.

Here's my latest and greatest patch function. patch_fun controls the patching behavior. It's defaulted to the function coalesce. A custom function can be thrown in for other use cases, life if, for example, you only wish to patch NA's. Columns to patch can be specified in ..., otherwise the patch patches all columns common to data and patch_data minus the columns listed in by. by can not be null. The correctness of the patch is checked in a few places. A patch must be in injection onto the data that's being patched (i.e. a one-to-one relationship). I typically have to group and summaries to get my real world patches into a one-to-one relationship.

@jennybc example above works as specified.

require(dplyr)
require(lazyeval)

patch <- function(data, patch_data, ..., by = NULL, na_only=FALSE, patch_fun=coalesce) {
  patch_cols <- unname(dplyr::select_vars(colnames(data), ...))
  if(length(patch_cols) == 0) patch_cols <- NULL
  patch_(data, patch_data, patch_cols, by, na_only, patch_fun)
}

patch_ <- function(data, patch_data, patch_cols = NULL, by = NULL, na_only=FALSE, patch_fun=coalesce) {
  if(is.null(by)) {
    error("`by` must be specified")
  }

  # Find common cols
  common_cols  <- intersect(colnames(data), colnames(patch_data))

  if(is.null(patch_cols)) {
    patch_cols <- common_cols %>% setdiff(by)
    # TODO fire off a warning 
  }

  # No missing columns 
  missing_cols <- setdiff(c(patch_cols, by), common_cols) 
  if( length(missing_cols) > 0 ) {
    stop("Can not apply patch, columns ", paste(missing_cols, sep=", "),
         "must be in both the original data and the patch")
  }

  # No columns being patched, warn and return data as-is
  if( length(patch_cols) == 0 ) {
      warning("No rows in y to patch onto x")
      return(data)
  }

  # Can not join by a patching column
  if( length(intersect(by, patch_cols)) != 0 ) {
    stop("Cannot patch a joining column")
  }

  # Builds each term of transmute expressions for colname of x
  build_expr <- function(colname) {
    if(colname %in% patch_cols) {
      # coalesce the two columns together x = coalesce(y, x)
      interp(~patch_fun(colname_y, colname), 
             colname   = as.name(paste(colname, "x", sep=".")),
             colname_y = as.name(paste(colname, "y",sep="."))
      )
    } else {
      # identity x = x 
      interp( ~colname, colname=as.name(colname) ) 
    }
  }

  expr <- Map(build_expr, colnames(data)) 

  # number of rows produced by join should be unchanged,
  # keep only needed columns and use only distinct rows

  joined <- left_join(
    data,
    patch_data %>% select(one_of(union(patch_cols, by))) %>% distinct,
    by=by
  )

  if( nrow(data) != nrow(joined)) {
    stop("patch cannot be many-to-one with respect to data")
  }

  joined %>% transmute_(.dots=expr)
}
@hadley
Copy link
Member

@hadley hadley commented Feb 21, 2017

Just a note to say that upsert() feels like the right name for one of these behaviours.

@alistaire47
Copy link

@alistaire47 alistaire47 commented Apr 20, 2017

I've found myself using this sort of update_join when aggregating data for a package (coming to GitHub once a few bugs are squashed). It's largely just a left or full join and then coalesceing the duplicated columns, but a related issue arises:

I have redundant key columns, but they're incomplete. Trying to update key columns with NAs before attempting to join everything requires dropping cases with NAs from one of the tables beforehand to avoid a cross join of NAs. That's fine, but suggests the possibility of updating keys: if one is NA, use the other to join and update, effectively combining keys by | instead of &.

This might fit in with join_by, maybe as a helper function working like nesting does within expand or complete, e.g. what would currently take something like

library(tidyverse)
set.seed(47)

df1 <- data_frame(key1a = c(1, 1, NA, NA, 3, 3),
                  key1b = c('a', 'a', 'b', 'b', NA, NA),
                  key2 = c(1:2, 1:2, 1:2),
                  var1 = rnorm(6))

df2 <- data_frame(key1a = c(1, 1, 2, 2, 3, 3),
                  key1b = c(NA, NA, 'b', 'b', 'c', 'c'),
                  key2 = c(1:2, 1:2, 1:2),
                  var2 = runif(6))

df1 %>% drop_na(key1a) %>% 
    full_join(df2, by = c('key1a', 'key2')) %>% 
    mutate(key1b = coalesce(key1b.x, key1b.y)) %>% 
    select(-var1, -contains('.')) %>% 
    left_join(df1, by = c('key1a', 'key2')) %>% 
    mutate(key1b = key1b.x) %>% 
    left_join(df1, by = c('key1b', 'key2')) %>% 
    mutate(key1a = key1a.x, 
           var1 = coalesce(var1.x, var1.y)) %>% 
    select(!!!rlang::syms(union(names(df1), names(df2))))

#> # A tibble: 6 × 5
#>   key1a key1b  key2       var1       var2
#>   <dbl> <chr> <int>      <dbl>      <dbl>
#> 1     1     a     1  1.9946963 0.16219364
#> 2     1     a     2  0.7111425 0.59930702
#> 3     3     c     1  0.1087755 0.40050280
#> 4     3     c     2 -1.0857375 0.03094497
#> 5     2     b     1  0.1854053 0.50603611
#> 6     2     b     2 -0.2817650 0.90197352

or could be written with update joins as something like

df1 <- df1 %>% 
    update_join(df2, by = c('key1b', 'key2')) %>% 
    update_join(df2, by = c('key1a', 'key2'))
df2 <- df2 %>% 
    update_join(df1, by = c('key1b', 'key2')) %>% 
    update_join(df1, by = c('key1a', 'key2'))

left_join(df1, df2, by = c('key1a', 'key1b', 'key2'))

could just be

left_join(df1, df2 by = join_by(updating(key1a, key1b), key2))

In essence then, it's just a series of crossing update joins on key columns before the final join, and thus shouldn't take much more code.

@hadley
Copy link
Member

@hadley hadley commented Nov 16, 2017

Summarising the discussion, I think there are at least patching cases described here:

  • Patch individual values described by a location (key variables + name of variable to fix), the new value, and a comment.

  • Combine two data frames where y contains some of the same variables as x. Match on key variables, then either alway takes the values from y, or only take them if x is NA.

  • Combine two data frames with identical variables. Replace matching rows in x with y, and also add new rows. This feels most like upsert().

I wonder if these could be called patch_val(), patch_col() and patch_row().

I'm not sure if only replacing missing values is a core feature or an incidental detail.

@hadley hadley changed the title Dataset Patching tools. Dataset patching tools Nov 16, 2017
@billdenney
Copy link
Contributor

@billdenney billdenney commented Dec 30, 2017

I have a use case that is your second patching case ("Combine two data frames where y contains some of the same variables as x. Match on key variables, then either always takes the values from y, or only take them if x is NA.") and is a variant on that. (From the conversation initiated in the dplyr issue just linked.)

What you described as patch_col() further generalizes as, "If values are missing in column_1, replace them with values in column_2." And more so, allow an arbitrary number of columns and column groups. My use case for this generalization is that I do meta-analyses, and when I do them, I often have various columns representing the potential number of measurements in an observation. usually I work with clinical studies, so the N is number of subjects in a clinical trial, the various N I may have available include from most to least preferred:

  • number of subjects contributing to an observed mean
  • number of subjects in the current treatment arm of the study
  • number of subjects in the study divided by number of treatment arms

What do you think of the below as an implementation:

#' Patch missing values in a set of columns to fill in the first column.
#'
#' @param data A data frame.
#' @param ... Column names (as used by
#'   \code{\link[tidyselect]{vars_select}}).  These cannot be paired
#'   with the \code{suffix} argument.
#' @param remove If \code{TRUE}, remove all columns but the first from
#'   the output data frame.
#' @param na Values which should be replaced.
#' @param suffix A character vector of column name suffixes to combine.
#'   (Useful if a \code{merge} or \code{join} generated the data frame
#'   and multiple pairs of columns share the suffix).
#' @return The data frame with values merged into the first requested
#'   column.
#' @importFrom tidyselect vars_select
#' @export
patch_col <- function(data, ..., remove=TRUE, na=NA, suffix=c()) {
  vars <- tidyselect::vars_select(names(data), ..., .strict=TRUE)
  if (length(vars) > 0 & length(suffix) > 0) {
    stop("Cannot use ... and suffix at the same time.")
  }
  if (length(suffix) > 0) {
    patch_col_suffix(data=data, remove=remove, na=na, suffix=suffix)
  } else {
    patch_col_set(data=data, remove=remove, na=na, vars=vars)
  }
}

patch_col_set <- function(data, vars=c(), remove=TRUE, na=NA, newname=vars[1]) {
  if (length(vars) < 2) {
    stop("At least two columns must be provided to merge")
  }
  # Apply appropriate coercion tests here; for now, errors occur on
  # attempted patching if not possible.
  missing_val <- data[[vars[[1]]]] %in% na
  data[[newname]] <- data[[vars[[1]]]]
  idx <- 2
  while (any(missing_val) & idx <= length(vars)) {
    data[[newname]][missing_val] <- data[[vars[[idx]]]][missing_val]
    idx <- idx + 1
    missing_val <- data[[newname]] %in% na
  }
  if (remove) {
    data[,setdiff(names(data), setdiff(vars, newname)), drop=FALSE]
  } else {
    data
  }
}

#' @importFrom purrr reduce
patch_col_suffix <- function(data, remove=TRUE, na=NA, suffix=c()) {
  trim_suffix <- function(current_suffix, cols) {
    mask_match <- endsWith(cols, current_suffix)
    if (any(mask_match)) {
      substring(cols[mask_match], 1, nchar(cols[mask_match]) - nchar(current_suffix))
    } else {
      character(0)
    }
  }
  if (length(suffix) < 2) {
    stop("Must have at least two suffixes to combine.")
  }
  trimmed_columns <-
    lapply(suffix,
           trim_suffix,
           cols=names(data))
  duplicated_columns <- purrr::reduce(.x=trimmed_columns, .f=intersect)
  if (length(duplicated_columns)) {
    for (i in seq_along(duplicated_columns)) {
      data <- patch_col_set(data=data,
                            vars=paste0(duplicated_columns[i],
                                        suffix),
                            remove=remove,
                            na=na,
                            newname=duplicated_columns[i])
    }
    data
  } else {
    stop("No duplicated columns with the provided suffixes")
  }
}

library(dplyr)
# Without patching
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6), 
  by = "A")
#>   A B.x C.x B.y C.y
#> 1 1   2   3  NA  NA
#> 2 4  NA  NA   5   6

# With patching by name (values go into 'B.x')
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6), 
  by = "A") %>% patch_col(B.x, B.y)
#>   A B.x C.x C.y
#> 1 1   2   3  NA
#> 2 4   5  NA   6

# With patching by suffix (values go into 'B' and 'C')
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6), 
  by = "A") %>% patch_col(suffix = c(".x", ".y"))
#>   A B C
#> 1 1 2 3
#> 2 4 5 6
@billdenney
Copy link
Contributor

@billdenney billdenney commented Dec 30, 2017

While I was at it, how about this for patch_val. An extension to this would have the arguments either be a named list (which would look for equality; current version) or a formula (which would be evaluated and coerced to a logical in the environment of the data.frame):

#' Update one or more values in a data frame
#'
#' @param data A data frame
#' @param ... Named arguments to match.  The value of the argument is
#'   compared against all values in \code{data[[nm]]} with \code{%in%},
#'   so argument values may be a scalar or a vector.
#' @param .new_val A named list of new values to put in the row(s) found.
#' @return \code{data} with updated values.
#' @export
patch_val <- function(data, ..., .new_val) {
  args <- list(...)
  if (length(args) == 0 & nrow(data) != 1) {
    stop("Must give at least one row to match unless there is only one row of data.")
  } else if (length(args) && (is.null(names(args)) | any(names(args) %in% ""))) {
    stop("All arguments must be named.")
  } else if (is.null(names(.new_val)) || any(names(.new_val %in% ""))) {
    stop(".new_val must be a named list.")
  }
  mask_match <- rep(TRUE, nrow(data))
  for (nm in names(args)) {
    mask_match <- mask_match & data[[nm]] %in% args[[nm]]
  }
  if (any(mask_match)) {
    for (nm in names(.new_val)) {
      data[[nm]][mask_match] <- .new_val[[nm]]
    }
  }
  data
}
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data.frame(A = 1:5, B = 6:10, C = 11:15, D = c(LETTERS[1:4], NA), stringsAsFactors = FALSE)
#>   A  B  C    D
#> 1 1  6 11    A
#> 2 2  7 12    B
#> 3 3  8 13    C
#> 4 4  9 14    D
#> 5 5 10 15 <NA>

data.frame(A = 1:5, B = 6:10, C = 11:15, D = c(LETTERS[1:4], NA), stringsAsFactors = FALSE) %>% 
  patch_val(A = 1, .new_val = list(D = "Q"))
#>   A  B  C    D
#> 1 1  6 11    Q
#> 2 2  7 12    B
#> 3 3  8 13    C
#> 4 4  9 14    D
#> 5 5 10 15 <NA>
@ctmann
Copy link

@ctmann ctmann commented Jul 28, 2018

I would settle for adding a replacement option to left_join (replace.with.y=TRUE)

@lionel-
Copy link
Member

@lionel- lionel- commented Oct 29, 2019

There is an implementation in tidyverse/dplyr#4595 (comment)

@moodymudskipper
Copy link

@moodymudskipper moodymudskipper commented Dec 12, 2019

I had implemented this idea of patching in my package safejoin, which is wrapped around dplyr's join functions.

I have a conflict argument, which is a 2 argument function will be applies on each names present in pairs of conflicting columns (same name and not included in "by").

The patching you describe here can be done by using conflict = ~dplyr::coalesce(.y, .x). If we want NA values from 2nd data frame to have precedence over non NA values from data frame 1 we use the special value conflict = "patch" (which seems to be broken in my package but that's another issue).

It makes more sense to me to integrate this in join operations than new verbs because we might want to use it with full_join (close equivalent to @hadley's insert in tidyverse/dplyr#4654) , semi_join (close to @hadley's update in tidyverse/dplyr#4654), left_join (patching as described here), or inner_join etc.

coalescing is the most common request (see all these SO questions : https://stackoverflow.com/search?q=safejoin ) but having a general conflict argument allows to add values in some cases, or to create list element, or to pack conflicting columns with conflict = ~tibble(x=.x,y=.y). conflict = ~.x means we just keep the first value, so it's safe and we don't have a column name disappearing mysteriously depending on what we have in the second data.frame (which leads to frustrating debugging).

With @billdenney 's example above :

df1 <- tibble::tibble(A = 1, B = 2, C = 3)
df2 <- tibble::tibble(A = 4, B = 5, C = 6)

# patching as defined above
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~dplyr::coalesce(.y, .x))
#> # A tibble: 2 x 3
#>       A     B     C
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

# packing
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~tibble::tibble(x=.x, y=.y))
#> # A tibble: 2 x 3
#>       A   B$x    $y   C$x    $y
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     3    NA
#> 2     4    NA     5    NA     6

# nesting
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~purrr::map2(.x,.y, list))
#> # A tibble: 2 x 3
#>       A B          C         
#>   <dbl> <list>     <list>    
#> 1     1 <list [2]> <list [2]>
#> 2     4 <list [2]> <list [2]>

# ignoring right side df conflicting columns
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~.x)
#> # A tibble: 2 x 3
#>       A     B     C
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4    NA    NA

Created on 2019-12-12 by the reprex package (v0.3.0)

I'd much rather see these features in dplyr than safejoin.

@hadley
Copy link
Member

@hadley hadley commented Apr 21, 2020

Most likely to be implemented as part of tidyverse/dplyr#4654

@hadley hadley closed this Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants