Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More joins #1275

Closed
bramtayl opened this issue Jul 17, 2015 · 1 comment
Closed

More joins #1275

bramtayl opened this issue Jul 17, 2015 · 1 comment

Comments

@bramtayl
Copy link

Insert join: where the by variables match, simply take all row data from the newer data frame and replace the row data in the older data frame. This is useful for a subset-mutate-update type workflow.

Update join: a regular full join, but when there are common variable names, move all non-NA values from the column in the newer data frame to the older data frame. This is useful to a summarize-mutate-update type workflow.

I've implemented this (likely very badly) in the loopr CRAN package. I'm basically given up on the stacking system there as overcomplication, but I think that the joins are broadly applicable. See code below.

#' Amend variables with new information
#' 
#' Replace all non-NA values in one set of columns with values from another matching set
#' @importFrom magrittr %>%
#' @export
#' 
#' @param data A data frame
#' @param originalNames A vector of column names with out-of-date information
#' @param amendNames A vector of column names with amended information. They will be removed at the end of processing.
#' @return An amended \code{\link{tbl_df}}

amendColumns = function(data, originalNames, amendNames) {
  dataNames = dplyr::data_frame(originalNames,
                                amendNames) %>%
    dplyr::mutate(index = 1:length(originalNames))
  #build calls using ifelse
  calls = plyr::dlply(.data = dataNames,
                      .fun = function(row) lazyeval::lazy(
                        ifelse(is.na(amendNames), 
                               originalNames, 
                               amendNames)) %>%
                        lazyeval::interp(amendNames = as.name(row$amendNames),
                                         originalNames = as.name(row$originalNames)),
                      .variables = "index") %>%
    setNames(originalNames)

  data %>% 
    dplyr::mutate_(.dots = calls) %>%
    dplyr::select_(.dots = sprintf("-`%s`", amendNames))}

#' Amend a dataframe with new information
#' 
#' \code{\link{full_join}} two dataframes. If there are matching columns, 
#' amend each \code{data} column with the corresponding \code{amendData} column using \code{\link{amendColumns}}.
#' 
#' @importFrom magrittr %>%
#' @export
#' 
#' @param data A data frame
#' @param amendData A data frame
#' @param by A quoted vector of column names to join by. If set to NULL or unspecified, will default to the grouping columns in data
#' @param suffix A suffix used internally. No existing column names should use this suffix.
#' @return An amended \code{\link{tbl_df}}
amend = function(data, amendData, by = NULL, suffix = "toFix") {

  #default by variables from groups
  if (is.null(by)) by = 
    data %>% 
    dplyr::groups() %>% 
    lapply(deparse) %>% 
    unlist %>% 
    as.vector 

  if (is.null(by)) stop("Defaulted to merging by data grouping variables. However, no grouping variables found")

  #figure out which columns need to be merged.
  commonNames = intersect(names(data), names(amendData)) %>% 
    setdiff(by)
  if (length(commonNames) != 0) message("Amending columns: ", paste(commonNames, collapse = ", "))

  #if no columns need to be merge, a simple full join
  if (length(commonNames) == 0) dplyr::full_join(data, amendData) else {

    #else update columns then join
    toFix = paste0(commonNames, suffix = suffix)

    if (sum(toFix %in% names(amendData)) > 0) stop ("suffix conflict. Please choose another suffix.")

    names(toFix) = commonNames

    byLiteral = by %>% sprintf("`%s`", .)

    amendData %>%
      plyr::rename(toFix) %>%
      dplyr::full_join(data, by) %>% 
      amendColumns(commonNames, unname(toFix)) %>%
      dplyr::arrange_(.dots = byLiteral)}}

#' Insert new information into a dataframe.
#' 
#' \code{\link{anti_join}} data with insertData, then \code{\link{bind_cols}} of insertData, then arrange by \code{by} variables.
#' @importFrom magrittr %>%
#' @export
#' 
#' @param data A data frame
#' @param insertData A data frame
#' @param by A quoted vector of column names to join by.
#' @return An inserted \code{\link{tbl_df}}
insert = function(data, insertData, by)
  data %>%
  dplyr::anti_join(insertData, by = by) %>%
  dplyr::bind_rows(insertData) %>%
  dplyr::arrange_(.dots = by)
@hadley
Copy link
Member

hadley commented Aug 24, 2015

I think these are nice ideas, but I currently don't think that dplyr is the right place for them - I want to stick to the pure SQL joins for now.

@hadley hadley closed this as completed Aug 24, 2015
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants