feature request : add `merge` indicator after a merge in dplyr #2183

randomgambit · 2016-10-18T14:57:49Z

Hello there,

Congrats for the great work here! I have a suggestion to make.

Is there a way to get the equivalent of a _merge indicator variable after a merge in Dplyr?

Something similar to Pandas' indicator = True option that essentially tells you how the merge went (how many matches from each dataset, etc).

Here is an example in Pandas

import pandas as pd

df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})

match = df1.merge(df2, how = 'left', indicator = True)

Here, after a left join between df1 and df2, you want to immediately know how many rows in df1 found a match in df2 and how many of them did not

match
Out[53]: 
  key1  v1   v2     _merge
0    a   1  4.0       both
1    b   2  5.0       both
2    c   3  NaN  left_only

and I can tabulate this merge variable:

match._merge.value_counts()
Out[52]: 
both          2
left_only     1
right_only    0
Name: _merge, dtype: int64

I don't see any option available after a, say, left join in Dplyr

key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)

> left_join(df1,df2, by = c('key1' = 'key2'))
  key1 v1 v2
1    a  1  4
2    b  2  5
3    c  3 NA

Am I missing something here?

Originally posted on SO http://stackoverflow.com/questions/40110644/dplyr-is-there-a-merge-indicator-available-after-a-merge.

The solution provided by the user is nice, but I believe this should be be an option implemented directly in dplyr merge. Having control over how the match went is fundamental.

What do you think?
Thanks!

The text was updated successfully, but these errors were encountered:

adamMaier · 2016-10-18T19:34:10Z

Stata similarly creates a new variable _merge when doing any type of merge or join. It works nearly identically to the Pandas example above. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it, especially when doing basic data exploration and cleaning.

I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable _merge

For the last few months I've been using basic functions I wrote that simply embellish the dplyr joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join. If you set the option .merge = T you'll get a variable similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.)

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
                            .merge = FALSE, ...){

    # Checking to make sure used variable names are not already in use
    if(".x_tracker" %in% names(x)){
        message("Warning: variable .x_tracker in left data was dropped")
    }
    if(".y_tracker" %in% names(y)){
        message("Warning: variable .y_tracker in right data was dropped")
    }
    if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
        stop("Variable .merge already exists; change name before proceeding")
    }

    # Adding simple merge tracker variables to data frames
    x[, ".x_tracker"] <- 1
    y[, ".y_tracker"] <- 1

    # Doing full join
    joined <- full_join(x, y, by = by, suffix = suffix,  ...)

    # Calculating merge diagnoses 
    matched <- joined %>%
        filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
        NROW()
    unmatched_x <- joined %>%
        filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
        NROW()
    unmatched_y <- joined %>%
        filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
        NROW()

    # Print merge diagnoses
    message(
        unmatched_x, " Rows ONLY from left data frame", "\n",
        unmatched_y, " Rows ONLY from right data frame", "\n",
        matched, " Rows matched"
    )

    # Create .merge variable if specified
    if(.merge){
        joined <- joined %>%
            mutate(.merge = 
                       case_when(
                           !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                           is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                           TRUE ~ "matched"
                           )
                   )
    }

    # Dropping tracker variables and returning data frame
    joined <- joined %>%
        select(-.x_tracker, -.y_tracker)
    return(joined)

}

randomgambit · 2016-10-25T16:16:44Z

hello everyone! just a follow up, is anyone interested in coding that up one day? I think this would be a really useful feature

thanks!!

krlmlr · 2016-11-07T14:52:35Z

I'm still wondering how to implement this for SQL backends.

randomgambit · 2016-11-09T15:49:30Z

Hi @krlmlr , I dont know about SQL backends, but providing this info after a regular merge with a regular dataframe in RStudio would be, for itself, a great first step. In my humble opinion, having control over the matching process is probably one of the most important things when analysing data.

Thanks!

krlmlr · 2016-11-09T19:45:26Z

@adamMaier: Actually, after looking more carefully at your implementation, I think it can be changed so that it also works for SQL backends. In some cases, computing number of matched or unmatched rows might benefit from a compute() call for SQL backends.

@hadley: Are tracked joins a good fit for dplyr, or perhaps tidyr? For data frames, we could certainly be more efficient than adding tracker columns.

hadley · 2016-11-09T21:55:29Z

@krlmlr it seems like a good idea to me.

adamMaier · 2016-11-23T13:42:01Z

Thanks, all. Glad to see you're considering this.

@krlmlr : Yes, you can certainly implement a more efficient approach than what I wrote with additional tracker columns. My approach was just a simple fill-in until I could figure out a smarter way to do this. (I'm coming from a longtime Stata background, so don't know much about SQL backends.) Happy to help in any way I can.

krlmlr · 2016-11-23T18:23:21Z

@adamMaier: Actually, I don't think it's much more than to use mutate() instead of [<-(), and perhaps count(joined, ...) to compute matched, unmatched_x and unmatched_y.

krlmlr · 2016-12-01T17:59:11Z

Related: #1619.

randomgambit · 2016-12-02T18:07:16Z

Hello there @krlmlr @hadley ! I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr ? Cannot wait to see how my merge are doing :D

iangow · 2017-03-29T22:34:41Z

I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr ? Cannot wait to see how my merges are doing :D

In the meantime, one approach is something like this:

library(dplyr, warn.conflicts = FALSE)

df_x <- memdb_frame(a = 1:2600L, b = rep(letters, 100))

df_y <- memdb_frame(a = 1501:2500L, c = rep(LETTERS[1:10], 100))

merged <- df_x %>%
    mutate(on_x = 1L) %>%
    left_join(
        df_y %>% 
            mutate(on_y = 1L)) %>%
    mutate(on_x = coalesce(on_x, 0L),
           on_y = coalesce(on_y, 0L))
#> Joining, by = "a"

merged %>%
    count(on_x, on_y)
#> Source:   query [?? x 3]
#> Database: sqlite 3.11.1 [:memory:]
#> Groups: on_x
#> 
#>    on_x  on_y     n
#>   <int> <int> <int>
#> 1     1     0  1600
#> 2     1     1  1000

Or, using PostgreSQL (I think this would work with tibbles too).

library(dplyr, warn.conflicts = FALSE)

pg <- src_postgres()

df_x <- 
    tibble(a = 1:2600L, b = rep(letters, 100)) 

df_x <- copy_to(pg, df_x)

df_y <- tibble(a = 2001:3000L, c = rep(LETTERS[1:10], 100))

df_y <- copy_to(pg, df_y)

merged <- df_x %>%
    mutate(on_x = TRUE) %>%
    full_join(
        df_y %>% 
            mutate(on_y = TRUE)) %>%
    mutate(on_x = coalesce(on_x, FALSE),
           on_y = coalesce(on_y, FALSE))
#> Joining, by = "a"

merged %>%
    count(on_x, on_y)
#> Source:   query [?? x 3]
#> Database: postgres 9.6.2 [igow@hostname:5432/crsp]
#> Groups: on_x
#> 
#>    on_x  on_y     n
#>   <lgl> <lgl> <dbl>
#> 1 FALSE  TRUE   400
#> 2  TRUE FALSE  2000
#> 3  TRUE  TRUE   600

randomgambit · 2017-06-21T16:54:40Z

@iangow @hadley

thanks for the suggestion. I think that works but, again, it is a pain to create fake vectors of 1s in each dataframe every time one does a merge. The fact that @hadley removed this from to_do means you are dropping this feature from future versions of dplyr?

Thanks!

hadley · 2017-06-21T17:13:26Z

No, it just means it didn't make it for this release.

daniloimparato · 2019-01-10T23:13:30Z

This would be a very useful feature indeed. Can't wait to see it implemented.

adamMaier · 2019-01-11T13:52:30Z

@daniloimparato: I ended up putting a function to print diagnostics after implementing a dplyr join (as well as functions to print helpful information after other dplyr functions, like filter, mutate, and summarize) in a package, here. There is help documentation and a vignette in the package in case you're interested: reviewr vignette.pdf

I mostly created it for my own use and for data analysts with whom I work as I find having noisier functions helpful in the early data wrangling stages. I'm using it as a stand-in until something like this becomes a feature of dplyr.

daniloimparato · 2019-01-11T17:50:32Z

@adamMaier Sounds cool. I'll give it a try :)

For the past week I've been flirting with matthieugomez' statar::join as it additionally sports some nice key integrity checking beforehand. It bugs me that such features aren't prioritized.

elbersb · 2019-05-27T15:27:42Z

Possibly of interest: https://github.com/elbersb/tidylog
And especially this issue: elbersb/tidylog#25
Feedback welcome!

hadley · 2019-12-10T23:23:49Z

Given that tidylog has solved this problem so wonderfully, I no longer think this needs to be in dplyr — thanks @elbersb !

JackLandry · 2019-12-20T02:48:36Z

While I appreciate the ability of tidylog to give summary information about how a merge went, the original request was for an option that would create a new variable that tells you if a given row was merged. I think that would still be really useful! For instance, I might like to subsequently subset the data to only the rows that merged for some operations. I was looking to see if this feature existed and found this thread.

iangow · 2019-12-20T02:50:35Z

@JackLandry I thought the same thing. It might be worth asking @elbersb if the wrapper functions in tidylog could be used to that end. The code above gives a stop-gap solution.

lock · 2020-06-24T19:24:41Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

randomgambit mentioned this issue Oct 27, 2016

[Feature] Would be nice to have a simple diagnosis after a join #2202

Closed

krlmlr mentioned this issue Dec 1, 2016

Preventing Addition of Rows in left_join #2278

Closed

hadley added database feature a feature request or enhancement labels Feb 2, 2017

hadley added verbs 🏃‍♀️ and removed database labels Feb 14, 2017

krlmlr mentioned this issue Mar 16, 2018

[Feature Request] Join Information added to dplyr joins functions #3231

Closed

hadley closed this as completed Dec 10, 2019

lock bot locked and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request : add `merge` indicator after a merge in dplyr #2183

feature request : add `merge` indicator after a merge in dplyr #2183

randomgambit commented Oct 18, 2016 •

edited

Loading

adamMaier commented Oct 18, 2016

randomgambit commented Oct 25, 2016 •

edited

Loading

krlmlr commented Nov 7, 2016

randomgambit commented Nov 9, 2016 •

edited

Loading

krlmlr commented Nov 9, 2016

hadley commented Nov 9, 2016

adamMaier commented Nov 23, 2016

krlmlr commented Nov 23, 2016 •

edited

Loading

krlmlr commented Dec 1, 2016

randomgambit commented Dec 2, 2016 •

edited

Loading

iangow commented Mar 29, 2017

randomgambit commented Jun 21, 2017

hadley commented Jun 21, 2017

daniloimparato commented Jan 10, 2019

adamMaier commented Jan 11, 2019 •

edited

Loading

daniloimparato commented Jan 11, 2019 •

edited

Loading

elbersb commented May 27, 2019

hadley commented Dec 10, 2019

JackLandry commented Dec 20, 2019

iangow commented Dec 20, 2019

lock bot commented Jun 24, 2020

feature request : add merge indicator after a merge in dplyr #2183

feature request : add merge indicator after a merge in dplyr #2183

Comments

randomgambit commented Oct 18, 2016 • edited Loading

adamMaier commented Oct 18, 2016

randomgambit commented Oct 25, 2016 • edited Loading

krlmlr commented Nov 7, 2016

randomgambit commented Nov 9, 2016 • edited Loading

krlmlr commented Nov 9, 2016

hadley commented Nov 9, 2016

adamMaier commented Nov 23, 2016

krlmlr commented Nov 23, 2016 • edited Loading

krlmlr commented Dec 1, 2016

randomgambit commented Dec 2, 2016 • edited Loading

iangow commented Mar 29, 2017

randomgambit commented Jun 21, 2017

hadley commented Jun 21, 2017

daniloimparato commented Jan 10, 2019

adamMaier commented Jan 11, 2019 • edited Loading

daniloimparato commented Jan 11, 2019 • edited Loading

elbersb commented May 27, 2019

hadley commented Dec 10, 2019

JackLandry commented Dec 20, 2019

iangow commented Dec 20, 2019

lock bot commented Jun 24, 2020

feature request : add `merge` indicator after a merge in dplyr #2183

feature request : add `merge` indicator after a merge in dplyr #2183

randomgambit commented Oct 18, 2016 •

edited

Loading

randomgambit commented Oct 25, 2016 •

edited

Loading

randomgambit commented Nov 9, 2016 •

edited

Loading

krlmlr commented Nov 23, 2016 •

edited

Loading

randomgambit commented Dec 2, 2016 •

edited

Loading

adamMaier commented Jan 11, 2019 •

edited

Loading

daniloimparato commented Jan 11, 2019 •

edited

Loading