Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request : add merge indicator after a merge in dplyr #2183

Closed
randomgambit opened this issue Oct 18, 2016 · 21 comments
Closed

feature request : add merge indicator after a merge in dplyr #2183

randomgambit opened this issue Oct 18, 2016 · 21 comments
Labels
feature a feature request or enhancement verbs 🏃‍♀️

Comments

@randomgambit
Copy link

randomgambit commented Oct 18, 2016

Hello there,

Congrats for the great work here! I have a suggestion to make.

Is there a way to get the equivalent of a _merge indicator variable after a merge in Dplyr?

Something similar to Pandas' indicator = True option that essentially tells you how the merge went (how many matches from each dataset, etc).

Here is an example in Pandas

import pandas as pd

df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})

match = df1.merge(df2, how = 'left', indicator = True)

Here, after a left join between df1 and df2, you want to immediately know how many rows in df1 found a match in df2 and how many of them did not

match
Out[53]: 
  key1  v1   v2     _merge
0    a   1  4.0       both
1    b   2  5.0       both
2    c   3  NaN  left_only

and I can tabulate this merge variable:

match._merge.value_counts()
Out[52]: 
both          2
left_only     1
right_only    0
Name: _merge, dtype: int64

I don't see any option available after a, say, left join in Dplyr

key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)

> left_join(df1,df2, by = c('key1' = 'key2'))
  key1 v1 v2
1    a  1  4
2    b  2  5
3    c  3 NA

Am I missing something here?

Originally posted on SO http://stackoverflow.com/questions/40110644/dplyr-is-there-a-merge-indicator-available-after-a-merge.

The solution provided by the user is nice, but I believe this should be be an option implemented directly in dplyr merge. Having control over how the match went is fundamental.

What do you think?
Thanks!

@adamMaier
Copy link

Stata similarly creates a new variable _merge when doing any type of merge or join. It works nearly identically to the Pandas example above. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it, especially when doing basic data exploration and cleaning.

I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable _merge

For the last few months I've been using basic functions I wrote that simply embellish the dplyr joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join. If you set the option .merge = T you'll get a variable similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.)

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
                            .merge = FALSE, ...){

    # Checking to make sure used variable names are not already in use
    if(".x_tracker" %in% names(x)){
        message("Warning: variable .x_tracker in left data was dropped")
    }
    if(".y_tracker" %in% names(y)){
        message("Warning: variable .y_tracker in right data was dropped")
    }
    if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
        stop("Variable .merge already exists; change name before proceeding")
    }

    # Adding simple merge tracker variables to data frames
    x[, ".x_tracker"] <- 1
    y[, ".y_tracker"] <- 1

    # Doing full join
    joined <- full_join(x, y, by = by, suffix = suffix,  ...)

    # Calculating merge diagnoses 
    matched <- joined %>%
        filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
        NROW()
    unmatched_x <- joined %>%
        filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
        NROW()
    unmatched_y <- joined %>%
        filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
        NROW()

    # Print merge diagnoses
    message(
        unmatched_x, " Rows ONLY from left data frame", "\n",
        unmatched_y, " Rows ONLY from right data frame", "\n",
        matched, " Rows matched"
    )

    # Create .merge variable if specified
    if(.merge){
        joined <- joined %>%
            mutate(.merge = 
                       case_when(
                           !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                           is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                           TRUE ~ "matched"
                           )
                   )
    }

    # Dropping tracker variables and returning data frame
    joined <- joined %>%
        select(-.x_tracker, -.y_tracker)
    return(joined)

}

@randomgambit
Copy link
Author

randomgambit commented Oct 25, 2016

hello everyone! just a follow up, is anyone interested in coding that up one day? I think this would be a really useful feature

thanks!!

@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

I'm still wondering how to implement this for SQL backends.

@randomgambit
Copy link
Author

randomgambit commented Nov 9, 2016

Hi @krlmlr , I dont know about SQL backends, but providing this info after a regular merge with a regular dataframe in RStudio would be, for itself, a great first step. In my humble opinion, having control over the matching process is probably one of the most important things when analysing data.

Thanks!

@krlmlr
Copy link
Member

krlmlr commented Nov 9, 2016

@adamMaier: Actually, after looking more carefully at your implementation, I think it can be changed so that it also works for SQL backends. In some cases, computing number of matched or unmatched rows might benefit from a compute() call for SQL backends.

@hadley: Are tracked joins a good fit for dplyr, or perhaps tidyr? For data frames, we could certainly be more efficient than adding tracker columns.

@hadley
Copy link
Member

hadley commented Nov 9, 2016

@krlmlr it seems like a good idea to me.

@adamMaier
Copy link

Thanks, all. Glad to see you're considering this.

@krlmlr : Yes, you can certainly implement a more efficient approach than what I wrote with additional tracker columns. My approach was just a simple fill-in until I could figure out a smarter way to do this. (I'm coming from a longtime Stata background, so don't know much about SQL backends.) Happy to help in any way I can.

@krlmlr
Copy link
Member

krlmlr commented Nov 23, 2016

@adamMaier: Actually, I don't think it's much more than to use mutate() instead of [<-(), and perhaps count(joined, ...) to compute matched, unmatched_x and unmatched_y.

@krlmlr
Copy link
Member

krlmlr commented Dec 1, 2016

Related: #1619.

@randomgambit
Copy link
Author

randomgambit commented Dec 2, 2016

Hello there @krlmlr @hadley ! I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr ? Cannot wait to see how my merge are doing :D

@hadley hadley added database feature a feature request or enhancement labels Feb 2, 2017
@iangow
Copy link

iangow commented Mar 29, 2017

I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr ? Cannot wait to see how my merges are doing :D

In the meantime, one approach is something like this:

library(dplyr, warn.conflicts = FALSE)

df_x <- memdb_frame(a = 1:2600L, b = rep(letters, 100))

df_y <- memdb_frame(a = 1501:2500L, c = rep(LETTERS[1:10], 100))

merged <- df_x %>%
    mutate(on_x = 1L) %>%
    left_join(
        df_y %>% 
            mutate(on_y = 1L)) %>%
    mutate(on_x = coalesce(on_x, 0L),
           on_y = coalesce(on_y, 0L))
#> Joining, by = "a"

merged %>%
    count(on_x, on_y)
#> Source:   query [?? x 3]
#> Database: sqlite 3.11.1 [:memory:]
#> Groups: on_x
#> 
#>    on_x  on_y     n
#>   <int> <int> <int>
#> 1     1     0  1600
#> 2     1     1  1000

Or, using PostgreSQL (I think this would work with tibbles too).

library(dplyr, warn.conflicts = FALSE)

pg <- src_postgres()

df_x <- 
    tibble(a = 1:2600L, b = rep(letters, 100)) 

df_x <- copy_to(pg, df_x)

df_y <- tibble(a = 2001:3000L, c = rep(LETTERS[1:10], 100))

df_y <- copy_to(pg, df_y)

merged <- df_x %>%
    mutate(on_x = TRUE) %>%
    full_join(
        df_y %>% 
            mutate(on_y = TRUE)) %>%
    mutate(on_x = coalesce(on_x, FALSE),
           on_y = coalesce(on_y, FALSE))
#> Joining, by = "a"

merged %>%
    count(on_x, on_y)
#> Source:   query [?? x 3]
#> Database: postgres 9.6.2 [igow@hostname:5432/crsp]
#> Groups: on_x
#> 
#>    on_x  on_y     n
#>   <lgl> <lgl> <dbl>
#> 1 FALSE  TRUE   400
#> 2  TRUE FALSE  2000
#> 3  TRUE  TRUE   600

@randomgambit
Copy link
Author

@iangow @hadley

thanks for the suggestion. I think that works but, again, it is a pain to create fake vectors of 1s in each dataframe every time one does a merge. The fact that @hadley removed this from to_do means you are dropping this feature from future versions of dplyr?

Thanks!

@hadley
Copy link
Member

hadley commented Jun 21, 2017

No, it just means it didn't make it for this release.

@daniloimparato
Copy link

This would be a very useful feature indeed. Can't wait to see it implemented.

@adamMaier
Copy link

adamMaier commented Jan 11, 2019

@daniloimparato: I ended up putting a function to print diagnostics after implementing a dplyr join (as well as functions to print helpful information after other dplyr functions, like filter, mutate, and summarize) in a package, here. There is help documentation and a vignette in the package in case you're interested: reviewr vignette.pdf

I mostly created it for my own use and for data analysts with whom I work as I find having noisier functions helpful in the early data wrangling stages. I'm using it as a stand-in until something like this becomes a feature of dplyr.

@daniloimparato
Copy link

daniloimparato commented Jan 11, 2019

@adamMaier Sounds cool. I'll give it a try :)

For the past week I've been flirting with matthieugomez' statar::join as it additionally sports some nice key integrity checking beforehand. It bugs me that such features aren't prioritized.

@elbersb
Copy link

elbersb commented May 27, 2019

Possibly of interest: https://github.com/elbersb/tidylog
And especially this issue: elbersb/tidylog#25
Feedback welcome!

@hadley
Copy link
Member

hadley commented Dec 10, 2019

Given that tidylog has solved this problem so wonderfully, I no longer think this needs to be in dplyr — thanks @elbersb !

@hadley hadley closed this as completed Dec 10, 2019
@JackLandry
Copy link

While I appreciate the ability of tidylog to give summary information about how a merge went, the original request was for an option that would create a new variable that tells you if a given row was merged. I think that would still be really useful! For instance, I might like to subsequently subset the data to only the rows that merged for some operations. I was looking to see if this feature existed and found this thread.

@iangow
Copy link

iangow commented Dec 20, 2019

@JackLandry I thought the same thing. It might be worth asking @elbersb if the wrapper functions in tidylog could be used to that end. The code above gives a stop-gap solution.

@lock
Copy link

lock bot commented Jun 24, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

8 participants