Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Preferred Column Values After Merge #3267

Closed
billdenney opened this issue Dec 29, 2017 · 7 comments
Closed

Feature Request: Preferred Column Values After Merge #3267

billdenney opened this issue Dec 29, 2017 · 7 comments
Labels
feature a feature request or enhancement verbs 🏃‍♀️

Comments

@billdenney
Copy link
Contributor

I often want to join two data.frames and then select the "best" result from the output columns.

This happens when I may have two sources of information with partially overlapping information. One source may be more reliable than the other, so I would prefer to use source 1 if it has a value. If source 1 doesn't have a value, I'd prefer to use the other source.

The function below does what I'm looking for, and I think it would fit in well in dplyr. If of interest, I can generate a pull request with this and tests.

#' Following a join, select non-missing values from one source
#' removing the other and combining duplicated column names.
#'
#' @param x The previously-joined data.frame (or similar)
#' @param suffix_order The column name suffixes to combine in preference
#'   order.  Values with the first suffix that are not missing will be
#'   kept.  Values missing with the first suffix will be replaced with
#'   values from the second suffix.
#' @param missing_values Either a vector or a list.  If a vector, all
#'   columns will be compared to the vector.  If a list, then column
#'   names will be matched to list names (and if not all column names.
#' @param duplicated_columns Column names (without the suffix) that are
#'   duplicated.  If not provided, they will be auto-detected.
#' @return A data.frame (or similar; input class is not changed) with
#'   \code{duplicated_columns} combined into one based on preference
#'   order.
#' @examples
#' # Missing values are substituted
#' full_join(data.frame(A=1, B=2, C=3),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer()
#'
#' # Non-missing values (C==7) are kept
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer()
#'
#' # Not all columns must be operated on
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(duplicated_columns="B")
#'
#' # Nonstandard missing values may be used
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=7)
#'
#' # Multiple missing values may be used
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=c(NA, 7))
#'
#' # Missing values can be specific to a column
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=list(B=NA))
#'
#' # Nonstandard missing values can be specific to a column
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=list(B=5))
#' @export
join_prefer <- function(x, suffix_order=c(".x", ".y"),
                        missing_values=NA,
                        duplicated_columns=NULL) {
  trim_suffix <- function(nm, suffix) {
    mask_match <- grepl(suffix, nm, fixed=TRUE)
    if (any(mask_match)) {
      gsub(suffix, "", nm[mask_match], fixed=TRUE)
    } else {
      character(0)
    }
  }
  if (is.null(duplicated_columns)) {
    if (is.list(missing_values)) {
      duplicated_columns <- names(missing_values)
    } else {
      trimmed_columns_1 <- trim_suffix(names(x), suffix_order[1])
      trimmed_columns_2 <- trim_suffix(names(x), suffix_order[2])
      duplicated_columns <- intersect(trimmed_columns_1, trimmed_columns_2)
    }
  }
  expected_columns <- c(paste0(duplicated_columns, suffix_order[1]),
                        paste0(duplicated_columns, suffix_order[2]))
  if (any(mask_missing_columns <- !(expected_columns %in% names(x)))) {
    stop("All duplicated_columns with suffix must be names of columns of x.  The following are not found: ",
         paste(expected_columns[mask_missing_columns], collapse=", "))
  }
  if (is.list(missing_values)) {
    if (!all(mask_not_found <- names(missing_values) %in% duplicated_columns)) {
      stop("Some names of missing_values are not duplicated_columns: ",
           paste(names(missing_values)[mask_not_found], collapse=", "))
    } else if (!all(mask_not_found <- duplicated_columns %in% names(missing_values))) {
      stop("Some duplicated_columns are not in the names of missing_values: ",
           paste(duplicated_columns[mask_not_found], collapse=", "))
    }
  }
  if (length(duplicated_columns)) {
    for (nm in duplicated_columns) {
      colname_1 <- paste0(nm, suffix_order[1])
      colname_2 <- paste0(nm, suffix_order[2])
      x[[nm]] <- x[[colname_1]]
      current_missing <-
        if (is.list(missing_values)) {
          missing_values[[nm]]
        } else {
          missing_values
        }
      mask_missing <- x[[colname_1]] %in% current_missing
      x[[nm]][mask_missing] <- x[[colname_2]][mask_missing]
      # Drop the duplicated column names
      x[[colname_1]] <- x[[colname_2]] <- NULL
    }
  } else {
    message("No duplicated columns found.")
  }
  x
}
@krlmlr krlmlr added feature a feature request or enhancement verbs 🏃‍♀️ labels Dec 30, 2017
@krlmlr
Copy link
Member

krlmlr commented Dec 30, 2017

Thanks. This looks like an useful feature, I would use it for applying patch tables: partial dictionaries that contain replacement values for only a subset of the data. Maybe we can create new verbs for this operation, because it only seems useful in full joins?

update_join <- function(x, y, by) ...
combine_join <- function(x, y, by) ...

(Not sure what a good naming choice would be here.)

The operations can be implemented much faster when combining values right away during the join. Also, SQL backends could do this very efficiently. We'd have to decide what to do with incompatible data types, I'd say we should be strict and throw an error. What do you think?

@JohnMount
Copy link

Perhaps stay near SQL naming and use the term coalesce()?

@billdenney
Copy link
Contributor Author

I'm glad that you like the idea, @krlmlr!

I see it as useful in many scenarios (not just in the context of a join). An example for me is that I often work with meta-analysis databases (data are extracted from many journal articles in the scientific literature), and I have similar sets of information from different articles which I'd want to choose the best from. For example, the number of individuals with a measurement may come from a column specific to "number of individuals with a measurement" or it may come from "number of individuals in the study". (And that sounds like your partial dictionary example, with the difference that it originates in a single data set rather than from a join.)

Within the context of joins, I see value in full_join and in a variant of semi_join where I would not want to expand x (as full_join may).

My reason for giving the example is: I like the function existing separately so that I can use it at times when I may not just be joining. But, I do often need it with a join, and I can see that it would be much faster performed during the join than post-processing when part of a join.

For naming... shrug I don't have a strong preference. Of the choices you list, I like "update" rather than "combine" because that sounds more like the operation occurring to me. (Edit just before sending, I like @JohnMount 's suggestion of coalesce.)

For incompatible data types there are two questions:

  • Incompatible in missing_values, which I think should be an error if given as a list and an attempt at coercion if given as a vector. My method was that if %in% could compare, that seemed sufficient.
  • Incompatible in the potentially duplicated columns, which I think makes sense to be an error (after trying some simple coercion).

@krlmlr
Copy link
Member

krlmlr commented Dec 30, 2017

The proposed operations are some join combined with a vectorized reduction operator, by default coalesce().

I'd like to keep an initial implementation as simple as possible, no "new" columns should be allowed on the RHS, and all observations from the LHS are kept. We may want to be able to specify which side takes precedence, this seems to give two separately useful operations (LHS wins: partial dictionary update, RHS wins: upsert). Would that solve your most common use cases?

A standalone verb that operates on an already expanded table might be better suited in tidyr. Can you outline a use case?

Some of this has been covered in more detail in tidyverse/tidyr#183, and earlier in #2075. I now think performance considerations make it worthwhile to consider an implementation in dplyr, a tidyr interface might add more bells and whistles.

@billdenney
Copy link
Contributor Author

@krlmlr, I agree that a simpler version makes sense within dplyr and a more bells-and-whistles interface makes sense for tidyr. I'll add into the tidyr discussion. I'm not the best to help with a SQL-friendly implementation, unfortunately.

@hadley
Copy link
Member

hadley commented May 27, 2019

Sorry, I think this is out of scope for dplyr, and would be best in tidyr.

@hadley hadley closed this as completed May 27, 2019
@lock
Copy link

lock bot commented Nov 23, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

4 participants