New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset patching tools #183

Open
MarcusWalz opened this Issue Apr 19, 2016 · 13 comments

Comments

Projects
None yet
6 participants
@MarcusWalz
Copy link

MarcusWalz commented Apr 19, 2016

Formerly filed in dpylr repo.

I'm doing an analysis that requires a lot of manual corrections to the original dataset. I keep tables of changes that mirror the original (see example in code) and then I use the code below to impute the original values. Probably niche, but who am I to decide. Here's my code:

library(dplyr)
library(lazyeval)

update_with = function(data, update, by) {

  # columns to be updated
  # (data intersect cols) -- by

  orig_order = colnames(data)
  update_cols =
    setdiff(intersect(colnames(data), colnames(update)), by)

  # left join update
  data = data %>% left_join(update, by=by)

  # transmutate one by one
  for(col in update_cols) {
    col.x = paste(col, "x", sep=".")
    col.y = paste(col, "y", sep=".")

    myifelse = function(x, y) {
      x[!is.na(y)] <- y[!is.na(y)]
      x
    }
    update = interp( ~myifelse(x, y)
                   , x=as.name(col.x)
                   , y=as.name(col.y)
                   )

    data = data %>%
      mutate_( .dots = setNames(list(update), col) ) %>%
      select( -one_of(c(col.x,col.y)) )
  }

  # deletes extra columns too
  data[,orig_order]
}



a = data.frame(x = 1:100,y = 5, z = 1:50)
b = data.frame(x = 1:10, y = 0)

# sets y to 0 when x  is in range 1:10, otherwise a remains the same 
a %>% update_with(b, by="x")

After using this method for awhile I keep my patches in this format, which assigns value to the cell described by Row and Column.

Row Column Value Data to help fill in value
1 x 123 hint: it's as easy as...
2 y xyz link to pathology report

and I remove the rightmost column use spread on the column Column and feed the subsequent dataframe into update_with. This probably has some unintended side effects when mixing integers and factors together in the value column. This has been surprisingly robust and I've been able to keep a single patch file for each table.

thanks,
Marcus

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 16, 2016

I like the idea, but that patch form feels a bit fragile to me - if the row order changes you'll silently patch the wrong values. I think it would be better to supply the value of some keys:

mtcars2 <- tibble::rownames_to_column(mtcars, "model")

patches <- frame_data(
  ~ model, ~ column, ~value, ~comment,
  "Mazda RX4", "mpg", 20, "Bad copy & paste"
)

mtcars %>% patch(patches)

This would require the yet unimplemented key-checking functions from dplyr to make sure that there was a unique match for each key.

@jennybc have you thought about this at all?

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented May 16, 2016

I've only thought about it up to ... agreeing that it would be really useful to have better tools for patching!

I have really awkward workflows for this and it seems nice to think of it as a novelty join, which is how I interpret your comment.

I'd been planning to delve into the new dplyr::case_when() the next time this came up. But re-skimming that issue (tidyverse/dplyr#631) and reading the help file makes me think case_when() doesn't really solve it? I just recalled that patching did come up explicitly over there.

Why would this go in tidyr vs dplyr?

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 17, 2016

Hmmm, interesting - that patch() feels more like a "mutate_when", which I'm not currently super big on. This patch() feels more like the sort of thing you'd want to create (either by hand or with an add-in) when you discover some bad values in your input.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented May 19, 2016

What I meant by "novelty join:" Like left join, but instead of getting mpg.x, mpg.y, wt.x, and wt.y, we get just mpg and wt in the result, with the values in patches taking precedence. The proposals are pretty different in terms of how to patch more than one variable.

library(dplyr)
(mtcars2 <- tibble::rownames_to_column(mtcars, "model") %>%
  head(3) %>% select(model, mpg, cyl, hp, drat, wt))
#>           model  mpg cyl  hp drat    wt
#> 1     Mazda RX4 21.0   6 110 3.90 2.620
#> 2 Mazda RX4 Wag 21.0   6 110 3.90 2.875
#> 3    Datsun 710 22.8   4  93 3.85 2.320

patches <- frame_data(
      ~ model, ~ mpg, ~ wt,
  "Mazda RX4",   500,  200
)

## fiction!
mtcars2 %>%
  patch(patches, by = "model")
#>           model   mpg cyl  hp drat    wt
#> 1     Mazda RX4 500.0   6 110 3.90 200.0
#> 2 Mazda RX4 Wag  21.0   6 110 3.90 2.875
#> 3    Datsun 710  22.8   4  93 3.85 2.320

BTW, assuming I understand what you mean by "mutate_when", I think one of those just came up on SO.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 19, 2016

@jennybc yes, that's what I was picturing too - except I'm not sure if you want to patch individual variables or whole rows. In your scenario I guess you'd use NA in patches to indicate that you didn't want to replace the value? I can imagining patching either cells or rows to be useful in different scenarios. It would be easy to convert between the two forms with gather/spread, so it wouldn't matter too much which was primary.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Jun 1, 2016

Recording a link to a related Twitter discussion. I think it's another example of the type of problem solved by the patching discussed here: "join and update columns, instead of duplicating them". It's an example where it's natural to update multiple variables at once.

In #Rstats, how to merge DFs (X,Y) with columns from Y replacing (overwriting) those having same name in X?
X has 10 cols A-J. Y has 3 cols F,G,H. Want cols from Y to replace cols in X that have same name; keep other cols from X.
Standard merge will give me new columns F.y, G.y, and H.y.

@MarcusWalz

This comment has been minimized.

Copy link
Author

MarcusWalz commented Jul 20, 2016

Hmm, I'm having to patch a lot in my current project that's a combination of an ETL process and manual data entry for variables my ETL pipeline doesn't capture. It's patches on top of patches on top of patches.

Here's my latest and greatest patch function. patch_fun controls the patching behavior. It's defaulted to the function coalesce. A custom function can be thrown in for other use cases, life if, for example, you only wish to patch NA's. Columns to patch can be specified in ..., otherwise the patch patches all columns common to data and patch_data minus the columns listed in by. by can not be null. The correctness of the patch is checked in a few places. A patch must be in injection onto the data that's being patched (i.e. a one-to-one relationship). I typically have to group and summaries to get my real world patches into a one-to-one relationship.

@jennybc example above works as specified.

require(dplyr)
require(lazyeval)

patch <- function(data, patch_data, ..., by = NULL, na_only=FALSE, patch_fun=coalesce) {
  patch_cols <- unname(dplyr::select_vars(colnames(data), ...))
  if(length(patch_cols) == 0) patch_cols <- NULL
  patch_(data, patch_data, patch_cols, by, na_only, patch_fun)
}

patch_ <- function(data, patch_data, patch_cols = NULL, by = NULL, na_only=FALSE, patch_fun=coalesce) {
  if(is.null(by)) {
    error("`by` must be specified")
  }

  # Find common cols
  common_cols  <- intersect(colnames(data), colnames(patch_data))

  if(is.null(patch_cols)) {
    patch_cols <- common_cols %>% setdiff(by)
    # TODO fire off a warning 
  }

  # No missing columns 
  missing_cols <- setdiff(c(patch_cols, by), common_cols) 
  if( length(missing_cols) > 0 ) {
    stop("Can not apply patch, columns ", paste(missing_cols, sep=", "),
         "must be in both the original data and the patch")
  }

  # No columns being patched, warn and return data as-is
  if( length(patch_cols) == 0 ) {
      warning("No rows in y to patch onto x")
      return(data)
  }

  # Can not join by a patching column
  if( length(intersect(by, patch_cols)) != 0 ) {
    stop("Cannot patch a joining column")
  }

  # Builds each term of transmute expressions for colname of x
  build_expr <- function(colname) {
    if(colname %in% patch_cols) {
      # coalesce the two columns together x = coalesce(y, x)
      interp(~patch_fun(colname_y, colname), 
             colname   = as.name(paste(colname, "x", sep=".")),
             colname_y = as.name(paste(colname, "y",sep="."))
      )
    } else {
      # identity x = x 
      interp( ~colname, colname=as.name(colname) ) 
    }
  }

  expr <- Map(build_expr, colnames(data)) 

  # number of rows produced by join should be unchanged,
  # keep only needed columns and use only distinct rows

  joined <- left_join(
    data,
    patch_data %>% select(one_of(union(patch_cols, by))) %>% distinct,
    by=by
  )

  if( nrow(data) != nrow(joined)) {
    stop("patch cannot be many-to-one with respect to data")
  }

  joined %>% transmute_(.dots=expr)
}
@hadley

This comment has been minimized.

Copy link
Member

hadley commented Feb 21, 2017

Just a note to say that upsert() feels like the right name for one of these behaviours.

@alistaire47

This comment has been minimized.

Copy link

alistaire47 commented Apr 20, 2017

I've found myself using this sort of update_join when aggregating data for a package (coming to GitHub once a few bugs are squashed). It's largely just a left or full join and then coalesceing the duplicated columns, but a related issue arises:

I have redundant key columns, but they're incomplete. Trying to update key columns with NAs before attempting to join everything requires dropping cases with NAs from one of the tables beforehand to avoid a cross join of NAs. That's fine, but suggests the possibility of updating keys: if one is NA, use the other to join and update, effectively combining keys by | instead of &.

This might fit in with join_by, maybe as a helper function working like nesting does within expand or complete, e.g. what would currently take something like

library(tidyverse)
set.seed(47)

df1 <- data_frame(key1a = c(1, 1, NA, NA, 3, 3),
                  key1b = c('a', 'a', 'b', 'b', NA, NA),
                  key2 = c(1:2, 1:2, 1:2),
                  var1 = rnorm(6))

df2 <- data_frame(key1a = c(1, 1, 2, 2, 3, 3),
                  key1b = c(NA, NA, 'b', 'b', 'c', 'c'),
                  key2 = c(1:2, 1:2, 1:2),
                  var2 = runif(6))

df1 %>% drop_na(key1a) %>% 
    full_join(df2, by = c('key1a', 'key2')) %>% 
    mutate(key1b = coalesce(key1b.x, key1b.y)) %>% 
    select(-var1, -contains('.')) %>% 
    left_join(df1, by = c('key1a', 'key2')) %>% 
    mutate(key1b = key1b.x) %>% 
    left_join(df1, by = c('key1b', 'key2')) %>% 
    mutate(key1a = key1a.x, 
           var1 = coalesce(var1.x, var1.y)) %>% 
    select(!!!rlang::syms(union(names(df1), names(df2))))

#> # A tibble: 6 × 5
#>   key1a key1b  key2       var1       var2
#>   <dbl> <chr> <int>      <dbl>      <dbl>
#> 1     1     a     1  1.9946963 0.16219364
#> 2     1     a     2  0.7111425 0.59930702
#> 3     3     c     1  0.1087755 0.40050280
#> 4     3     c     2 -1.0857375 0.03094497
#> 5     2     b     1  0.1854053 0.50603611
#> 6     2     b     2 -0.2817650 0.90197352

or could be written with update joins as something like

df1 <- df1 %>% 
    update_join(df2, by = c('key1b', 'key2')) %>% 
    update_join(df2, by = c('key1a', 'key2'))
df2 <- df2 %>% 
    update_join(df1, by = c('key1b', 'key2')) %>% 
    update_join(df1, by = c('key1a', 'key2'))

left_join(df1, df2, by = c('key1a', 'key1b', 'key2'))

could just be

left_join(df1, df2 by = join_by(updating(key1a, key1b), key2))

In essence then, it's just a series of crossing update joins on key columns before the final join, and thus shouldn't take much more code.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Nov 16, 2017

Summarising the discussion, I think there are at least patching cases described here:

  • Patch individual values described by a location (key variables + name of variable to fix), the new value, and a comment.

  • Combine two data frames where y contains some of the same variables as x. Match on key variables, then either alway takes the values from y, or only take them if x is NA.

  • Combine two data frames with identical variables. Replace matching rows in x with y, and also add new rows. This feels most like upsert().

I wonder if these could be called patch_val(), patch_col() and patch_row().

I'm not sure if only replacing missing values is a core feature or an incidental detail.

@hadley hadley changed the title Dataset Patching tools. Dataset patching tools Nov 16, 2017

@billdenney

This comment has been minimized.

Copy link
Contributor

billdenney commented Dec 30, 2017

I have a use case that is your second patching case ("Combine two data frames where y contains some of the same variables as x. Match on key variables, then either always takes the values from y, or only take them if x is NA.") and is a variant on that. (From the conversation initiated in the dplyr issue just linked.)

What you described as patch_col() further generalizes as, "If values are missing in column_1, replace them with values in column_2." And more so, allow an arbitrary number of columns and column groups. My use case for this generalization is that I do meta-analyses, and when I do them, I often have various columns representing the potential number of measurements in an observation. usually I work with clinical studies, so the N is number of subjects in a clinical trial, the various N I may have available include from most to least preferred:

  • number of subjects contributing to an observed mean
  • number of subjects in the current treatment arm of the study
  • number of subjects in the study divided by number of treatment arms

What do you think of the below as an implementation:

#' Patch missing values in a set of columns to fill in the first column.
#'
#' @param data A data frame.
#' @param ... Column names (as used by
#'   \code{\link[tidyselect]{vars_select}}).  These cannot be paired
#'   with the \code{suffix} argument.
#' @param remove If \code{TRUE}, remove all columns but the first from
#'   the output data frame.
#' @param na Values which should be replaced.
#' @param suffix A character vector of column name suffixes to combine.
#'   (Useful if a \code{merge} or \code{join} generated the data frame
#'   and multiple pairs of columns share the suffix).
#' @return The data frame with values merged into the first requested
#'   column.
#' @importFrom tidyselect vars_select
#' @export
patch_col <- function(data, ..., remove=TRUE, na=NA, suffix=c()) {
  vars <- tidyselect::vars_select(names(data), ..., .strict=TRUE)
  if (length(vars) > 0 & length(suffix) > 0) {
    stop("Cannot use ... and suffix at the same time.")
  }
  if (length(suffix) > 0) {
    patch_col_suffix(data=data, remove=remove, na=na, suffix=suffix)
  } else {
    patch_col_set(data=data, remove=remove, na=na, vars=vars)
  }
}

patch_col_set <- function(data, vars=c(), remove=TRUE, na=NA, newname=vars[1]) {
  if (length(vars) < 2) {
    stop("At least two columns must be provided to merge")
  }
  # Apply appropriate coercion tests here; for now, errors occur on
  # attempted patching if not possible.
  missing_val <- data[[vars[[1]]]] %in% na
  data[[newname]] <- data[[vars[[1]]]]
  idx <- 2
  while (any(missing_val) & idx <= length(vars)) {
    data[[newname]][missing_val] <- data[[vars[[idx]]]][missing_val]
    idx <- idx + 1
    missing_val <- data[[newname]] %in% na
  }
  if (remove) {
    data[,setdiff(names(data), setdiff(vars, newname)), drop=FALSE]
  } else {
    data
  }
}

#' @importFrom purrr reduce
patch_col_suffix <- function(data, remove=TRUE, na=NA, suffix=c()) {
  trim_suffix <- function(current_suffix, cols) {
    mask_match <- endsWith(cols, current_suffix)
    if (any(mask_match)) {
      substring(cols[mask_match], 1, nchar(cols[mask_match]) - nchar(current_suffix))
    } else {
      character(0)
    }
  }
  if (length(suffix) < 2) {
    stop("Must have at least two suffixes to combine.")
  }
  trimmed_columns <-
    lapply(suffix,
           trim_suffix,
           cols=names(data))
  duplicated_columns <- purrr::reduce(.x=trimmed_columns, .f=intersect)
  if (length(duplicated_columns)) {
    for (i in seq_along(duplicated_columns)) {
      data <- patch_col_set(data=data,
                            vars=paste0(duplicated_columns[i],
                                        suffix),
                            remove=remove,
                            na=na,
                            newname=duplicated_columns[i])
    }
    data
  } else {
    stop("No duplicated columns with the provided suffixes")
  }
}

library(dplyr)
# Without patching
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6), 
  by = "A")
#>   A B.x C.x B.y C.y
#> 1 1   2   3  NA  NA
#> 2 4  NA  NA   5   6

# With patching by name (values go into 'B.x')
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6), 
  by = "A") %>% patch_col(B.x, B.y)
#>   A B.x C.x C.y
#> 1 1   2   3  NA
#> 2 4   5  NA   6

# With patching by suffix (values go into 'B' and 'C')
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6), 
  by = "A") %>% patch_col(suffix = c(".x", ".y"))
#>   A B C
#> 1 1 2 3
#> 2 4 5 6
@billdenney

This comment has been minimized.

Copy link
Contributor

billdenney commented Dec 30, 2017

While I was at it, how about this for patch_val. An extension to this would have the arguments either be a named list (which would look for equality; current version) or a formula (which would be evaluated and coerced to a logical in the environment of the data.frame):

#' Update one or more values in a data frame
#'
#' @param data A data frame
#' @param ... Named arguments to match.  The value of the argument is
#'   compared against all values in \code{data[[nm]]} with \code{%in%},
#'   so argument values may be a scalar or a vector.
#' @param .new_val A named list of new values to put in the row(s) found.
#' @return \code{data} with updated values.
#' @export
patch_val <- function(data, ..., .new_val) {
  args <- list(...)
  if (length(args) == 0 & nrow(data) != 1) {
    stop("Must give at least one row to match unless there is only one row of data.")
  } else if (length(args) && (is.null(names(args)) | any(names(args) %in% ""))) {
    stop("All arguments must be named.")
  } else if (is.null(names(.new_val)) || any(names(.new_val %in% ""))) {
    stop(".new_val must be a named list.")
  }
  mask_match <- rep(TRUE, nrow(data))
  for (nm in names(args)) {
    mask_match <- mask_match & data[[nm]] %in% args[[nm]]
  }
  if (any(mask_match)) {
    for (nm in names(.new_val)) {
      data[[nm]][mask_match] <- .new_val[[nm]]
    }
  }
  data
}
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data.frame(A = 1:5, B = 6:10, C = 11:15, D = c(LETTERS[1:4], NA), stringsAsFactors = FALSE)
#>   A  B  C    D
#> 1 1  6 11    A
#> 2 2  7 12    B
#> 3 3  8 13    C
#> 4 4  9 14    D
#> 5 5 10 15 <NA>

data.frame(A = 1:5, B = 6:10, C = 11:15, D = c(LETTERS[1:4], NA), stringsAsFactors = FALSE) %>% 
  patch_val(A = 1, .new_val = list(D = "Q"))
#>   A  B  C    D
#> 1 1  6 11    Q
#> 2 2  7 12    B
#> 3 3  8 13    C
#> 4 4  9 14    D
#> 5 5 10 15 <NA>
@ctmann

This comment has been minimized.

Copy link

ctmann commented Jul 28, 2018

I would settle for adding a replacement option to left_join (replace.with.y=TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment