-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset patching tools #183
Comments
I like the idea, but that patch form feels a bit fragile to me - if the row order changes you'll silently patch the wrong values. I think it would be better to supply the value of some keys: mtcars2 <- tibble::rownames_to_column(mtcars, "model")
patches <- frame_data(
~ model, ~ column, ~value, ~comment,
"Mazda RX4", "mpg", 20, "Bad copy & paste"
)
mtcars %>% patch(patches) This would require the yet unimplemented key-checking functions from dplyr to make sure that there was a unique match for each key. @jennybc have you thought about this at all? |
I've only thought about it up to ... agreeing that it would be really useful to have better tools for patching! I have really awkward workflows for this and it seems nice to think of it as a novelty join, which is how I interpret your comment. I'd been planning to delve into the new Why would this go in |
Hmmm, interesting - that |
What I meant by "novelty join:" Like left join, but instead of getting library(dplyr)
(mtcars2 <- tibble::rownames_to_column(mtcars, "model") %>%
head(3) %>% select(model, mpg, cyl, hp, drat, wt))
#> model mpg cyl hp drat wt
#> 1 Mazda RX4 21.0 6 110 3.90 2.620
#> 2 Mazda RX4 Wag 21.0 6 110 3.90 2.875
#> 3 Datsun 710 22.8 4 93 3.85 2.320
patches <- frame_data(
~ model, ~ mpg, ~ wt,
"Mazda RX4", 500, 200
)
## fiction!
mtcars2 %>%
patch(patches, by = "model")
#> model mpg cyl hp drat wt
#> 1 Mazda RX4 500.0 6 110 3.90 200.0
#> 2 Mazda RX4 Wag 21.0 6 110 3.90 2.875
#> 3 Datsun 710 22.8 4 93 3.85 2.320 BTW, assuming I understand what you mean by "mutate_when", I think one of those just came up on SO. |
@jennybc yes, that's what I was picturing too - except I'm not sure if you want to patch individual variables or whole rows. In your scenario I guess you'd use NA in |
Recording a link to a related Twitter discussion. I think it's another example of the type of problem solved by the patching discussed here: "join and update columns, instead of duplicating them". It's an example where it's natural to update multiple variables at once.
|
Hmm, I'm having to patch a lot in my current project that's a combination of an ETL process and manual data entry for variables my ETL pipeline doesn't capture. It's patches on top of patches on top of patches. Here's my latest and greatest patch function. @jennybc example above works as specified.
|
Just a note to say that |
I've found myself using this sort of I have redundant key columns, but they're incomplete. Trying to update key columns with This might fit in with library(tidyverse)
set.seed(47)
df1 <- data_frame(key1a = c(1, 1, NA, NA, 3, 3),
key1b = c('a', 'a', 'b', 'b', NA, NA),
key2 = c(1:2, 1:2, 1:2),
var1 = rnorm(6))
df2 <- data_frame(key1a = c(1, 1, 2, 2, 3, 3),
key1b = c(NA, NA, 'b', 'b', 'c', 'c'),
key2 = c(1:2, 1:2, 1:2),
var2 = runif(6))
df1 %>% drop_na(key1a) %>%
full_join(df2, by = c('key1a', 'key2')) %>%
mutate(key1b = coalesce(key1b.x, key1b.y)) %>%
select(-var1, -contains('.')) %>%
left_join(df1, by = c('key1a', 'key2')) %>%
mutate(key1b = key1b.x) %>%
left_join(df1, by = c('key1b', 'key2')) %>%
mutate(key1a = key1a.x,
var1 = coalesce(var1.x, var1.y)) %>%
select(!!!rlang::syms(union(names(df1), names(df2))))
#> # A tibble: 6 × 5
#> key1a key1b key2 var1 var2
#> <dbl> <chr> <int> <dbl> <dbl>
#> 1 1 a 1 1.9946963 0.16219364
#> 2 1 a 2 0.7111425 0.59930702
#> 3 3 c 1 0.1087755 0.40050280
#> 4 3 c 2 -1.0857375 0.03094497
#> 5 2 b 1 0.1854053 0.50603611
#> 6 2 b 2 -0.2817650 0.90197352 or could be written with update joins as something like df1 <- df1 %>%
update_join(df2, by = c('key1b', 'key2')) %>%
update_join(df2, by = c('key1a', 'key2'))
df2 <- df2 %>%
update_join(df1, by = c('key1b', 'key2')) %>%
update_join(df1, by = c('key1a', 'key2'))
left_join(df1, df2, by = c('key1a', 'key1b', 'key2')) could just be left_join(df1, df2 by = join_by(updating(key1a, key1b), key2)) In essence then, it's just a series of crossing update joins on key columns before the final join, and thus shouldn't take much more code. |
Summarising the discussion, I think there are at least patching cases described here:
I wonder if these could be called I'm not sure if only replacing missing values is a core feature or an incidental detail. |
I have a use case that is your second patching case ("Combine two data frames where What you described as
What do you think of the below as an implementation: #' Patch missing values in a set of columns to fill in the first column.
#'
#' @param data A data frame.
#' @param ... Column names (as used by
#' \code{\link[tidyselect]{vars_select}}). These cannot be paired
#' with the \code{suffix} argument.
#' @param remove If \code{TRUE}, remove all columns but the first from
#' the output data frame.
#' @param na Values which should be replaced.
#' @param suffix A character vector of column name suffixes to combine.
#' (Useful if a \code{merge} or \code{join} generated the data frame
#' and multiple pairs of columns share the suffix).
#' @return The data frame with values merged into the first requested
#' column.
#' @importFrom tidyselect vars_select
#' @export
patch_col <- function(data, ..., remove=TRUE, na=NA, suffix=c()) {
vars <- tidyselect::vars_select(names(data), ..., .strict=TRUE)
if (length(vars) > 0 & length(suffix) > 0) {
stop("Cannot use ... and suffix at the same time.")
}
if (length(suffix) > 0) {
patch_col_suffix(data=data, remove=remove, na=na, suffix=suffix)
} else {
patch_col_set(data=data, remove=remove, na=na, vars=vars)
}
}
patch_col_set <- function(data, vars=c(), remove=TRUE, na=NA, newname=vars[1]) {
if (length(vars) < 2) {
stop("At least two columns must be provided to merge")
}
# Apply appropriate coercion tests here; for now, errors occur on
# attempted patching if not possible.
missing_val <- data[[vars[[1]]]] %in% na
data[[newname]] <- data[[vars[[1]]]]
idx <- 2
while (any(missing_val) & idx <= length(vars)) {
data[[newname]][missing_val] <- data[[vars[[idx]]]][missing_val]
idx <- idx + 1
missing_val <- data[[newname]] %in% na
}
if (remove) {
data[,setdiff(names(data), setdiff(vars, newname)), drop=FALSE]
} else {
data
}
}
#' @importFrom purrr reduce
patch_col_suffix <- function(data, remove=TRUE, na=NA, suffix=c()) {
trim_suffix <- function(current_suffix, cols) {
mask_match <- endsWith(cols, current_suffix)
if (any(mask_match)) {
substring(cols[mask_match], 1, nchar(cols[mask_match]) - nchar(current_suffix))
} else {
character(0)
}
}
if (length(suffix) < 2) {
stop("Must have at least two suffixes to combine.")
}
trimmed_columns <-
lapply(suffix,
trim_suffix,
cols=names(data))
duplicated_columns <- purrr::reduce(.x=trimmed_columns, .f=intersect)
if (length(duplicated_columns)) {
for (i in seq_along(duplicated_columns)) {
data <- patch_col_set(data=data,
vars=paste0(duplicated_columns[i],
suffix),
remove=remove,
na=na,
newname=duplicated_columns[i])
}
data
} else {
stop("No duplicated columns with the provided suffixes")
}
}
library(dplyr)
# Without patching
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6),
by = "A")
#> A B.x C.x B.y C.y
#> 1 1 2 3 NA NA
#> 2 4 NA NA 5 6
# With patching by name (values go into 'B.x')
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6),
by = "A") %>% patch_col(B.x, B.y)
#> A B.x C.x C.y
#> 1 1 2 3 NA
#> 2 4 5 NA 6
# With patching by suffix (values go into 'B' and 'C')
full_join(data.frame(A = 1, B = 2, C = 3), data.frame(A = 4, B = 5, C = 6),
by = "A") %>% patch_col(suffix = c(".x", ".y"))
#> A B C
#> 1 1 2 3
#> 2 4 5 6 |
While I was at it, how about this for #' Update one or more values in a data frame
#'
#' @param data A data frame
#' @param ... Named arguments to match. The value of the argument is
#' compared against all values in \code{data[[nm]]} with \code{%in%},
#' so argument values may be a scalar or a vector.
#' @param .new_val A named list of new values to put in the row(s) found.
#' @return \code{data} with updated values.
#' @export
patch_val <- function(data, ..., .new_val) {
args <- list(...)
if (length(args) == 0 & nrow(data) != 1) {
stop("Must give at least one row to match unless there is only one row of data.")
} else if (length(args) && (is.null(names(args)) | any(names(args) %in% ""))) {
stop("All arguments must be named.")
} else if (is.null(names(.new_val)) || any(names(.new_val %in% ""))) {
stop(".new_val must be a named list.")
}
mask_match <- rep(TRUE, nrow(data))
for (nm in names(args)) {
mask_match <- mask_match & data[[nm]] %in% args[[nm]]
}
if (any(mask_match)) {
for (nm in names(.new_val)) {
data[[nm]][mask_match] <- .new_val[[nm]]
}
}
data
}
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data.frame(A = 1:5, B = 6:10, C = 11:15, D = c(LETTERS[1:4], NA), stringsAsFactors = FALSE)
#> A B C D
#> 1 1 6 11 A
#> 2 2 7 12 B
#> 3 3 8 13 C
#> 4 4 9 14 D
#> 5 5 10 15 <NA>
data.frame(A = 1:5, B = 6:10, C = 11:15, D = c(LETTERS[1:4], NA), stringsAsFactors = FALSE) %>%
patch_val(A = 1, .new_val = list(D = "Q"))
#> A B C D
#> 1 1 6 11 Q
#> 2 2 7 12 B
#> 3 3 8 13 C
#> 4 4 9 14 D
#> 5 5 10 15 <NA> |
I would settle for adding a replacement option to left_join (replace.with.y=TRUE) |
There is an implementation in tidyverse/dplyr#4595 (comment) |
I had implemented this idea of patching in my package safejoin, which is wrapped around dplyr's join functions. I have a The patching you describe here can be done by using It makes more sense to me to integrate this in join operations than new verbs because we might want to use it with full_join (close equivalent to @hadley's coalescing is the most common request (see all these SO questions : https://stackoverflow.com/search?q=safejoin ) but having a general With @billdenney 's example above : df1 <- tibble::tibble(A = 1, B = 2, C = 3)
df2 <- tibble::tibble(A = 4, B = 5, C = 6)
# patching as defined above
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~dplyr::coalesce(.y, .x))
#> # A tibble: 2 x 3
#> A B C
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
#> 2 4 5 6
# packing
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~tibble::tibble(x=.x, y=.y))
#> # A tibble: 2 x 3
#> A B$x $y C$x $y
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 NA 3 NA
#> 2 4 NA 5 NA 6
# nesting
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~purrr::map2(.x,.y, list))
#> # A tibble: 2 x 3
#> A B C
#> <dbl> <list> <list>
#> 1 1 <list [2]> <list [2]>
#> 2 4 <list [2]> <list [2]>
# ignoring right side df conflicting columns
safejoin::safe_full_join(df1, df2, by = "A", conflict = ~.x)
#> # A tibble: 2 x 3
#> A B C
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
#> 2 4 NA NA Created on 2019-12-12 by the reprex package (v0.3.0) I'd much rather see these features in dplyr than safejoin. |
Most likely to be implemented as part of tidyverse/dplyr#4654 |
Formerly filed in dpylr repo.
I'm doing an analysis that requires a lot of manual corrections to the original dataset. I keep tables of changes that mirror the original (see example in code) and then I use the code below to impute the original values. Probably niche, but who am I to decide. Here's my code:
After using this method for awhile I keep my patches in this format, which assigns
value
to the cell described byRow
andColumn
.and I remove the rightmost column use
spread
on the columnColumn
and feed the subsequent dataframe intoupdate_with
. This probably has some unintended side effects when mixing integers and factors together in thevalue
column. This has been surprisingly robust and I've been able to keep a single patch file for each table.thanks,
Marcus
The text was updated successfully, but these errors were encountered: