Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Join Information added to dplyr joins functions #3231

Closed
GeorgeRJacobs opened this issue Nov 29, 2017 · 11 comments
Closed
Labels
feature a feature request or enhancement verbs 🏃‍♀️

Comments

@GeorgeRJacobs
Copy link

GeorgeRJacobs commented Nov 29, 2017

Perhaps this asking for too much from the package. But, I find that when I am doing data joins, and the like with data in R, there can sometimes be unexpected behavior. Unaccounted for rows added ( left_join() ) and the like. Some sort of helpful output would be ideal for fact checking myself. Along the lines of:

df before: 10 rows

# Oops - this wasn't supposed to create new rows! 
df after: 15 rows

I usually do this anyways (nrow(), etc), but it might be helpful to display some 'metrics'. It might be especially helpful for new users starting with joins in R to have some useful output to go along with the mental tasks of join conditions.

Thanks

@batpigandme
Copy link
Contributor

batpigandme commented Nov 29, 2017

I haven't gone through all of these packages, but I recently came across @echasnovski's ruler package, which has a nice list of some ~tidy data validation tools at the end
https://github.com/echasnovski/ruler

@GeorgeRJacobs
Copy link
Author

Thanks for the link! From the vignettes (quick look), it appears that packages takes the ideas above, and pushes them to the logical conclusion. It would still require new students of R to put some thought into joins and expected output (probably a good thing!). Depending on the goals of the project, some 'gentle' output may still be warranted. Regardless, I will start taking a deeper look.

@batpigandme
Copy link
Contributor

I'm all for helpful error messages (I'll be curious to see what others think), my major concern would be the meaning that would come from the "mute sign" (i.e. if there's no message, you assume everything's fine).

My best analogy for this is that, for a long time, none of the four-way intersections on the roads around me had stop signs (we're talking 'burbs, not high traffic). You'd stop (or slow), and look, because there could be cars or kids coming. Then, a few years back, a family moved into a house nearby and somehow got stop signs approved for the three-way intersection in front of it (four if you count their driveway as a road). It drives me crazy, because, as a result, you'd assume you have the right of way/can safely go when there's no stop sign at the rest of the intersections on the road.

Sorry for the long-winded explanation!

@GeorgeRJacobs
Copy link
Author

GeorgeRJacobs commented Nov 29, 2017

No worries. I am smarter for it. So if I understand correctly, you are worried about the case where a mute sign occurs, and the person says in their head, "Join = Good. No more critical thinking required.". I think the question turns on whether this sort of message saves user's time by identifying behavior vs. changing a user's expectation of the 'safety' of their join operation. It may be that this shifting of expectation is too much, and therefore no change should be implemented. We will have to rely on packages like above to solve this problem.

@batpigandme
Copy link
Contributor

It's not so much that it'd be bad to have backups in place (again, I'm definitely curious to hear what the devs think), it's just that outside of the ecosystem of a single package, there's no way to control for the same scenario. Likewise, one would want to consider what comparable failsafes would be. Surely there are similar scenarios with other types of joins— in fact, an infinite number of things could go wrong (obviously that's an absurd line of argument), and you'd want to make sure that there's an even-handed expectation of validation. Data quality and validation are broad areas (it's actually what my dad did for years), and I really like the idea of packages that work well with "tidy data," but I'm not sure that this would be in-scope with the join function, if we're working off of the ethos that one function does one thing and does it well.

Again, I could be totally wrong on this. Just my 2¢s atm! 😉

@GeorgeRJacobs
Copy link
Author

I think I see your point. Hmm. Maybe you are right about this. If the join operations are working on tidy data, then what I propose is a lesser issue. And the fact the data isn't tidy would mean that you have more than just this specific problem to deal with most likely. Following that logic, dplyrs ethos should be tools for manipulating data using a common set of verbs, not suggesting ways to manipulate said data.

@krlmlr
Copy link
Member

krlmlr commented Mar 16, 2018

Thanks. I see some overlap with #2183 and #1792, can you confirm?

@krlmlr krlmlr added feature a feature request or enhancement data frame verbs 🏃‍♀️ labels Mar 16, 2018
@brshallo
Copy link

brshallo commented Nov 2, 2018

@GeorgeRJacobs I agree having an argument/option to print some “soft” metrics with the *_join functions might be nice. @batpigandme I hear your point regarding the “tidy data” ethos. Unconscious join errors happen so frequently that an exception here might not be bad. E.g., could print “how many (/any) rows duplicated from each dataframe”, “how many rows remain from each"… or some individual or short list of “foundational join quality metric(s)” …

Here’s a rough and dirty example:

library(tidyverse, warn.conflicts = FALSE, quietly = TRUE)

df1 <- tribble(
  ~key, ~val_x,
  1, "x1",
  2, "x2",
  3, "x3",
  6, "x4"
)

 df2 <- tribble(
  ~key, ~val_y,
  1, "y1",
  2, "y2",
  3, "y3",
  4, "y4",
  5, "y5"
)

inner_join_noisy <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), noisy = TRUE, ...){
  
  # QUICK example of a "quality metric" to print for `inner_join`
  
  output <- inner_join(x, y, by, copy, suffix, noisy, ...)
  
  print_num_rows <- function(x_or_y){
    input_name <- deparse(substitute(x_or_y))
    
    x_or_y %>% 
      semi_join(output, by = "key") %>% 
      count() %>% 
      with(n) %>% 
      cat(., " of ", nrow(x_or_y), " individual rows from ", input_name, " included in output.", "\n")
  }
  
  if(noisy){
    print_num_rows(x_or_y = x)
    print_num_rows(x_or_y = y)
  }
  
  output
}

inner_join_noisy(df1, df2, by = "key", noisy = TRUE)
#> 3  of  4  individual rows from  x  included in output. 
#> 3  of  5  individual rows from  y  included in output.
#> # A tibble: 3 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     3 x3    y3

(P.s. not advocating for GUI ETL tools here – though I did see a colleague using a BI tool that had some metrics like this pop-up when specifying a join and thought it seemed like a nifty feature – which led me here.)

@brshallo
Copy link

It looks like @elbersb tidylog package has implemented this type of information across joins and other dplyr verbs.

@hadley
Copy link
Member

hadley commented May 27, 2019

Duplicate of #2183, and I'd definitely recommend tidylog.

@hadley hadley closed this as completed May 27, 2019
@lock
Copy link

lock bot commented Nov 23, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

5 participants