-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request : add merge
indicator after a merge in dplyr
#2183
Comments
Stata similarly creates a new variable I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable For the last few months I've been using basic functions I wrote that simply embellish the full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
.merge = FALSE, ...){
# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
stop("Variable .merge already exists; change name before proceeding")
}
# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1
# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix, ...)
# Calculating merge diagnoses
matched <- joined %>%
filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
NROW()
unmatched_x <- joined %>%
filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
NROW()
unmatched_y <- joined %>%
filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
NROW()
# Print merge diagnoses
message(
unmatched_x, " Rows ONLY from left data frame", "\n",
unmatched_y, " Rows ONLY from right data frame", "\n",
matched, " Rows matched"
)
# Create .merge variable if specified
if(.merge){
joined <- joined %>%
mutate(.merge =
case_when(
!is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
TRUE ~ "matched"
)
)
}
# Dropping tracker variables and returning data frame
joined <- joined %>%
select(-.x_tracker, -.y_tracker)
return(joined)
} |
hello everyone! just a follow up, is anyone interested in coding that up one day? I think this would be a really useful feature thanks!! |
I'm still wondering how to implement this for SQL backends. |
Hi @krlmlr , I dont know about SQL backends, but providing this info after a regular merge with a regular dataframe in Thanks! |
@adamMaier: Actually, after looking more carefully at your implementation, I think it can be changed so that it also works for SQL backends. In some cases, computing number of matched or unmatched rows might benefit from a compute() call for SQL backends. @hadley: Are tracked joins a good fit for dplyr, or perhaps tidyr? For data frames, we could certainly be more efficient than adding tracker columns. |
@krlmlr it seems like a good idea to me. |
Thanks, all. Glad to see you're considering this. @krlmlr : Yes, you can certainly implement a more efficient approach than what I wrote with additional tracker columns. My approach was just a simple fill-in until I could figure out a smarter way to do this. (I'm coming from a longtime Stata background, so don't know much about SQL backends.) Happy to help in any way I can. |
@adamMaier: Actually, I don't think it's much more than to use |
Related: #1619. |
In the meantime, one approach is something like this: library(dplyr, warn.conflicts = FALSE)
df_x <- memdb_frame(a = 1:2600L, b = rep(letters, 100))
df_y <- memdb_frame(a = 1501:2500L, c = rep(LETTERS[1:10], 100))
merged <- df_x %>%
mutate(on_x = 1L) %>%
left_join(
df_y %>%
mutate(on_y = 1L)) %>%
mutate(on_x = coalesce(on_x, 0L),
on_y = coalesce(on_y, 0L))
#> Joining, by = "a"
merged %>%
count(on_x, on_y)
#> Source: query [?? x 3]
#> Database: sqlite 3.11.1 [:memory:]
#> Groups: on_x
#>
#> on_x on_y n
#> <int> <int> <int>
#> 1 1 0 1600
#> 2 1 1 1000 Or, using PostgreSQL (I think this would work with tibbles too). library(dplyr, warn.conflicts = FALSE)
pg <- src_postgres()
df_x <-
tibble(a = 1:2600L, b = rep(letters, 100))
df_x <- copy_to(pg, df_x)
df_y <- tibble(a = 2001:3000L, c = rep(LETTERS[1:10], 100))
df_y <- copy_to(pg, df_y)
merged <- df_x %>%
mutate(on_x = TRUE) %>%
full_join(
df_y %>%
mutate(on_y = TRUE)) %>%
mutate(on_x = coalesce(on_x, FALSE),
on_y = coalesce(on_y, FALSE))
#> Joining, by = "a"
merged %>%
count(on_x, on_y)
#> Source: query [?? x 3]
#> Database: postgres 9.6.2 [igow@hostname:5432/crsp]
#> Groups: on_x
#>
#> on_x on_y n
#> <lgl> <lgl> <dbl>
#> 1 FALSE TRUE 400
#> 2 TRUE FALSE 2000
#> 3 TRUE TRUE 600 |
No, it just means it didn't make it for this release. |
This would be a very useful feature indeed. Can't wait to see it implemented. |
@daniloimparato: I ended up putting a function to print diagnostics after implementing a dplyr join (as well as functions to print helpful information after other dplyr functions, like filter, mutate, and summarize) in a package, here. There is help documentation and a vignette in the package in case you're interested: reviewr vignette.pdf I mostly created it for my own use and for data analysts with whom I work as I find having noisier functions helpful in the early data wrangling stages. I'm using it as a stand-in until something like this becomes a feature of |
@adamMaier Sounds cool. I'll give it a try :) For the past week I've been flirting with matthieugomez' statar::join as it additionally sports some nice key integrity checking beforehand. It bugs me that such features aren't prioritized. |
Possibly of interest: https://github.com/elbersb/tidylog |
Given that tidylog has solved this problem so wonderfully, I no longer think this needs to be in dplyr — thanks @elbersb ! |
While I appreciate the ability of tidylog to give summary information about how a merge went, the original request was for an option that would create a new variable that tells you if a given row was merged. I think that would still be really useful! For instance, I might like to subsequently subset the data to only the rows that merged for some operations. I was looking to see if this feature existed and found this thread. |
@JackLandry I thought the same thing. It might be worth asking @elbersb if the wrapper functions in |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
Hello there,
Congrats for the great work here! I have a suggestion to make.
Is there a way to get the equivalent of a
_merge
indicator variable after a merge inDplyr
?Something similar to Pandas'
indicator = True
option that essentially tells you how the merge went (how many matches from each dataset, etc).Here is an example in
Pandas
Here, after a
left join
betweendf1
anddf2
, you want to immediately know how many rows indf1
found a match indf2
and how many of them did notand I can tabulate this
merge
variable:I don't see any option available after a, say, left join in
Dplyr
Am I missing something here?
Originally posted on SO http://stackoverflow.com/questions/40110644/dplyr-is-there-a-merge-indicator-available-after-a-merge.
The solution provided by the user is nice, but I believe this should be be an option implemented directly in
dplyr
merge. Having control over how the match went is fundamental.What do you think?
Thanks!
The text was updated successfully, but these errors were encountered: