More joins #1275

bramtayl · 2015-07-17T09:08:47Z

Insert join: where the by variables match, simply take all row data from the newer data frame and replace the row data in the older data frame. This is useful for a subset-mutate-update type workflow.

Update join: a regular full join, but when there are common variable names, move all non-NA values from the column in the newer data frame to the older data frame. This is useful to a summarize-mutate-update type workflow.

I've implemented this (likely very badly) in the loopr CRAN package. I'm basically given up on the stacking system there as overcomplication, but I think that the joins are broadly applicable. See code below.

#' Amend variables with new information
#' 
#' Replace all non-NA values in one set of columns with values from another matching set
#' @importFrom magrittr %>%
#' @export
#' 
#' @param data A data frame
#' @param originalNames A vector of column names with out-of-date information
#' @param amendNames A vector of column names with amended information. They will be removed at the end of processing.
#' @return An amended \code{\link{tbl_df}}

amendColumns = function(data, originalNames, amendNames) {
  dataNames = dplyr::data_frame(originalNames,
                                amendNames) %>%
    dplyr::mutate(index = 1:length(originalNames))
  #build calls using ifelse
  calls = plyr::dlply(.data = dataNames,
                      .fun = function(row) lazyeval::lazy(
                        ifelse(is.na(amendNames), 
                               originalNames, 
                               amendNames)) %>%
                        lazyeval::interp(amendNames = as.name(row$amendNames),
                                         originalNames = as.name(row$originalNames)),
                      .variables = "index") %>%
    setNames(originalNames)

  data %>% 
    dplyr::mutate_(.dots = calls) %>%
    dplyr::select_(.dots = sprintf("-`%s`", amendNames))}

#' Amend a dataframe with new information
#' 
#' \code{\link{full_join}} two dataframes. If there are matching columns, 
#' amend each \code{data} column with the corresponding \code{amendData} column using \code{\link{amendColumns}}.
#' 
#' @importFrom magrittr %>%
#' @export
#' 
#' @param data A data frame
#' @param amendData A data frame
#' @param by A quoted vector of column names to join by. If set to NULL or unspecified, will default to the grouping columns in data
#' @param suffix A suffix used internally. No existing column names should use this suffix.
#' @return An amended \code{\link{tbl_df}}
amend = function(data, amendData, by = NULL, suffix = "toFix") {

  #default by variables from groups
  if (is.null(by)) by = 
    data %>% 
    dplyr::groups() %>% 
    lapply(deparse) %>% 
    unlist %>% 
    as.vector 

  if (is.null(by)) stop("Defaulted to merging by data grouping variables. However, no grouping variables found")

  #figure out which columns need to be merged.
  commonNames = intersect(names(data), names(amendData)) %>% 
    setdiff(by)
  if (length(commonNames) != 0) message("Amending columns: ", paste(commonNames, collapse = ", "))

  #if no columns need to be merge, a simple full join
  if (length(commonNames) == 0) dplyr::full_join(data, amendData) else {

    #else update columns then join
    toFix = paste0(commonNames, suffix = suffix)

    if (sum(toFix %in% names(amendData)) > 0) stop ("suffix conflict. Please choose another suffix.")

    names(toFix) = commonNames

    byLiteral = by %>% sprintf("`%s`", .)

    amendData %>%
      plyr::rename(toFix) %>%
      dplyr::full_join(data, by) %>% 
      amendColumns(commonNames, unname(toFix)) %>%
      dplyr::arrange_(.dots = byLiteral)}}

#' Insert new information into a dataframe.
#' 
#' \code{\link{anti_join}} data with insertData, then \code{\link{bind_cols}} of insertData, then arrange by \code{by} variables.
#' @importFrom magrittr %>%
#' @export
#' 
#' @param data A data frame
#' @param insertData A data frame
#' @param by A quoted vector of column names to join by.
#' @return An inserted \code{\link{tbl_df}}
insert = function(data, insertData, by)
  data %>%
  dplyr::anti_join(insertData, by = by) %>%
  dplyr::bind_rows(insertData) %>%
  dplyr::arrange_(.dots = by)

hadley · 2015-08-24T22:08:14Z

I think these are nice ideas, but I currently don't think that dplyr is the right place for them - I want to stick to the pure SQL joins for now.

hadley closed this as completed Aug 24, 2015

lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More joins #1275

More joins #1275

bramtayl commented Jul 17, 2015

hadley commented Aug 24, 2015

More joins #1275

More joins #1275

Comments

bramtayl commented Jul 17, 2015

hadley commented Aug 24, 2015