Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to set an option to globally turn off the multiple matches warning on joins? #6717

Closed
eipi10 opened this issue Feb 10, 2023 · 17 comments · Fixed by #6753
Closed
Milestone

Comments

@eipi10
Copy link
Contributor

eipi10 commented Feb 10, 2023

dplyr 1.1 has a new argument multiple in the join functions that, by default, throws a warning if a row in x matches more than one row in y. You have to explicitly set multiple="all" in every join in order to silence the warning. According to the help, this default was chosen because multiple matches are "usually surprising." In the data I work with (education) multiple matches are the norm (students attend in multiple terms and take multiple courses in each term; instructors have multiple students, teach more than one course and perhaps multiple sections of a course, etc.). The constant warnings are really annoying, but it would likewise be annoying to have to explicitly set the multiple argument every time I do a join (including many, many joins across existing functions in our internal packages that would need to be updated).

Ideally, the default would be not to warn. If you're not open to that change, is there any way to set an option at the top of a script that disables warnings related to multiple matches, and, if not, would you be willing to add such an option?

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 13, 2023

I see that you gave some text based description of your use cases, but could you please provide a simple reprex that shows a few of them?

@eipi10
Copy link
Contributor Author

eipi10 commented Feb 13, 2023

Below is an example of a typical use case: I have data frames with (1) student data (one row per student), (2) term data (one row per student per term), and (3) course enrollment data (one row per student per term per course enrollment), and I want join the student data to the term data and then join the student+term data to the course enrollment data.

I can silence the warning if I put the course enrollment data in the x position. But why add that extra cognitive load when I already have a workflow that works the way I want it to? I also have built-in checks to warn me if something weird (like a student ID appearing twice in the same term in the term data, or a student having two enrollments in the same course in the same term) happens.

The example below is just one among many use cases where I expect multiple matches. Over the years I've worked with data in several fields and I'm struggling to find an instance where I didn't typically expect multiple matches.

library(tidyverse)

student = structure(list(
  student.id = structure(c(3L, 1L, 4L, 2L), levels = c("S1", "S2", "S3", "S4"), class = "factor"),
  cohort = c("First-time", "Transfer", "First-time", "Transfer"),
  entering.term = c("Fall 2019", "Fall 2020", "Fall 2020", "Spring 2021")),
  row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

term = structure(list(
  student.id = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
                           1L, 1L, 4L, 4L, 4L, 4L, 4L),
                         levels = c("S1", "S2", "S3", "S4"), class = "factor"),
  term = c("Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022", "Fall 2020",
           "Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022", "Fall 2020",
           "Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022", "Fall 2020",
           "Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022"),
  terms.elapsed = c(1, 2, 3, 4, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
  unit.load = c(16, 18, 14, NA, 13, 15, 15, 15, 15, 16, 13, 0, 0, NA, 15, 15,
                15, 9, 0)),
  row.names = c(NA, -19L), class = c("tbl_df", "tbl", "data.frame"))

course = structure(list(
  student.id = structure(c(4L, 4L, 4L, 2L, 2L, 3L, 3L, 1L, 1L),
                         levels = c("S1", "S2", "S3", "S4"), class = "factor"),
  instructor.id = structure(c(1L, 1L, 4L, 3L, 3L, 2L, 2L, 3L, 3L),
                            levels = c("I1", "I2", "I3", "I4"), class = "factor"),
  course = c("ENGL 10", "ENGL 11", "ENGL 20", "POLS 130", "POLS 140",
             "HIST 17A", "HIST 171B", "POLS 198B", "POLS 198B"),
  class.section = c("36", "36", "18", "40", "40", "03", "01", "01", "01"),
  term = c("Fall 2020", "Spring 2021", "Spring 2022", "Fall 2021", "Fall 2021", "Fall 2020",
           "Spring 2021", "Fall 2020", "Spring 2021"),
  grade = c("A", "A-", "B-", "C-", "B", "B-", "CR", "CR", "A")),
  row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))

# One row per student
student
#> # A tibble: 4 × 3
#>   student.id cohort     entering.term
#>   <fct>      <chr>      <chr>        
#> 1 S3         First-time Fall 2019    
#> 2 S1         Transfer   Fall 2020    
#> 3 S4         First-time Fall 2020    
#> 4 S2         Transfer   Spring 2021

# One row per student per term
# (Multiple rows per student)
term
#> # A tibble: 19 × 4
#>    student.id term        terms.elapsed unit.load
#>    <fct>      <chr>               <dbl>     <dbl>
#>  1 S2         Spring 2021             1        16
#>  2 S2         Fall 2021               2        18
#>  3 S2         Spring 2022             3        14
#>  4 S2         Fall 2022               4        NA
#>  5 S3         Fall 2020               3        13
#>  6 S3         Spring 2021             4        15
#>  7 S3         Fall 2021               5        15
#>  8 S3         Spring 2022             6        15
#>  9 S3         Fall 2022               7        15
#> 10 S1         Fall 2020               1        16
#> 11 S1         Spring 2021             2        13
#> 12 S1         Fall 2021               3         0
#> 13 S1         Spring 2022             4         0
#> 14 S1         Fall 2022               5        NA
#> 15 S4         Fall 2020               1        15
#> 16 S4         Spring 2021             2        15
#> 17 S4         Fall 2021               3        15
#> 18 S4         Spring 2022             4         9
#> 19 S4         Fall 2022               5         0

# One row per course enrollment per term
# (Multiple rows for both students and instructors within and across terms)
course
#> # A tibble: 9 × 6
#>   student.id instructor.id course    class.section term        grade
#>   <fct>      <fct>         <chr>     <chr>         <chr>       <chr>
#> 1 S4         I1            ENGL 10   36            Fall 2020   A    
#> 2 S4         I1            ENGL 11   36            Spring 2021 A-   
#> 3 S4         I4            ENGL 20   18            Spring 2022 B-   
#> 4 S2         I3            POLS 130  40            Fall 2021   C-   
#> 5 S2         I3            POLS 140  40            Fall 2021   B    
#> 6 S3         I2            HIST 17A  03            Fall 2020   B-   
#> 7 S3         I2            HIST 171B 01            Spring 2021 CR   
#> 8 S1         I3            POLS 198B 01            Fall 2020   CR   
#> 9 S1         I3            POLS 198B 01            Spring 2021 A

# Gives multiple-match warnings
d1 = left_join(student, term)
#> Joining with `by = join_by(student.id)`
#> Warning in left_join(student, term): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 1 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#>   warning.
d1 = left_join(d1, course)
#> Joining with `by = join_by(student.id, term)`
#> Warning in left_join(d1, course): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 17 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#>   warning.

# No warnings when wrapped in reduce
d1a = reduce(list(student, term, course), left_join)
#> Joining with `by = join_by(student.id)`
#> Joining with `by = join_by(student.id, term)`

# Reverse order of join data frames to stop warning
d2 = left_join(term, student)
#> Joining with `by = join_by(student.id)`
d2 = left_join(course, d2)
#> Joining with `by = join_by(student.id, term)`

# Right join also gives multiple-match warning
d3 = right_join(student, term)
#> Joining with `by = join_by(student.id)`
#> Warning in right_join(student, term): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 1 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#>   warning.

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.0     purrr_1.0.1    
#> [5] readr_2.1.3     tidyr_1.3.0     tibble_3.1.8    ggplot2_3.4.0  
#> [9] tidyverse_1.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.9.0        tidyselect_1.2.0    xfun_0.37          
#>  [4] haven_2.5.1         gargle_1.3.0        colorspace_2.1-0   
#>  [7] vctrs_0.5.2         generics_0.1.3      htmltools_0.5.4    
#> [10] yaml_2.3.7          utf8_1.2.3          rlang_1.0.6        
#> [13] R.oo_1.25.0         pillar_1.8.1        glue_1.6.2         
#> [16] withr_2.5.0         DBI_1.1.3           R.utils_2.12.2     
#> [19] dbplyr_2.3.0        readxl_1.4.1        modelr_0.1.10      
#> [22] R.cache_0.16.0      lifecycle_1.0.3     munsell_0.5.0      
#> [25] gtable_0.3.1        cellranger_1.1.0    rvest_1.0.3        
#> [28] R.methodsS3_1.8.2   evaluate_0.20       knitr_1.42         
#> [31] tzdb_0.3.0          fastmap_1.1.0       fansi_1.0.4        
#> [34] broom_1.0.3         backports_1.4.1     scales_1.2.1       
#> [37] googlesheets4_1.0.1 jsonlite_1.8.4      fs_1.6.0           
#> [40] hms_1.1.2           digest_0.6.31       stringi_1.7.12     
#> [43] grid_4.2.2          cli_3.6.0           tools_4.2.2        
#> [46] magrittr_2.0.3      crayon_1.5.2        pkgconfig_2.0.3    
#> [49] ellipsis_0.3.2      xml2_1.3.3          timechange_0.2.0   
#> [52] reprex_2.0.2        googledrive_2.0.0   lubridate_1.9.1    
#> [55] assertthat_0.2.1    rmarkdown_2.20      httr_1.4.4         
#> [58] rstudioapi_0.14     R6_2.5.1            compiler_4.2.2

Created on 2023-02-13 with reprex v2.0.2

@carbo82
Copy link

carbo82 commented Feb 15, 2023

I also find this warning unnecessary. The left_join() function is quite explicit that it is potentially matching more than one rows.

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 15, 2023

@carbo82 we actually found and believe that many people do not think so https://twitter.com/hadleywickham/status/1435331147978903558?s=20
https://twitter.com/hadleywickham/status/1435952016224886784?s=20

@joranE
Copy link
Contributor

joranE commented Feb 15, 2023

fwiw, I also think the default on this is wrong. I think allowing the ability to warn on multiple matches in a join is great, but the default should be no warning, and users should be able to request warnings when they want them.

One-to-many joins are so common that warning on that by default is in my opinion excessive.

I could see an argument for warning by default in inner_join() where I think you could reasonably argue that the most common expectation is a 1-1 join, but I think 1-1 joins are very far from the most common use for left_join() or right_join(). One of the most common database design patterns is to have small fact tables with long descriptions for a set of values, and those long descriptions are added to results via a 1-many left join. Having to silence warnings all the time for that kind of use will get tiresome pretty quickly.

@eipi10
Copy link
Contributor Author

eipi10 commented Feb 15, 2023

@carbo82 we actually found and believe that many people do not think so https://twitter.com/hadleywickham/status/1435331147978903558?s=20 https://twitter.com/hadleywickham/status/1435952016224886784?s=20

You changed a consequential default based on what some folks who happened to be on Twitter and see Hadley's tweet in early September 2021 told you?

@joranE
Copy link
Contributor

joranE commented Feb 15, 2023

I think it's great that Hadley periodically asks for feedback on social media to get a broader sense of user's expectations and use cases.

In this particular instance, I'm skeptical that people's expectations around the behavior of inner_join() tell us much about what the right defaults should be for left_join() or right_join().

@eipi10
Copy link
Contributor Author

eipi10 commented Feb 15, 2023

It is indeed great that the Tidyverse team solicits feedback from users. It's not so great when they make a consequential decision based on the assumption that a convenience sample of Twitter users is representative of frequent users of ***_join functions or of join use cases.

Furthermore, there was never a need to impose a one-size-fits-all change that's bound to inconvenience a substantial number of Tidyverse users. If the multiple argument is here to stay, it seems like a global option to set the multiple default makes the most sense here. That way, those who find it beneficial can have it TRUE and those who find it annoying and obtrusive (either put up with frequent warnings or explicitly set multiple=FALSE every time) can set it to FALSE once in their .rprofile or at the top of a script.

@eipi10
Copy link
Contributor Author

eipi10 commented Feb 16, 2023

Thanks for taking another look at this Davis and for your thoughtful discussion in #6731.

@DavisVaughan
Copy link
Member

@eipi10 you brought up some good examples, so it was worth revisiting. Does the proposal in #6731 sound reasonable to you? After thinking about all of this some more, to me it seems much more reasonable and easier to justify.

@hadley
Copy link
Member

hadley commented Feb 16, 2023

@eipi10 Obviously, we don't make decisions solely on the basis of twitter polls (but they were a very useful tool for roughly gauging how the community feels about things). Since we were also worried about this change, we described it in pre-release blog post. We didn't hear any negative feedback prior to release, so it's not clear what we could have done differently.

@eipi10
Copy link
Contributor Author

eipi10 commented Feb 23, 2023

The new relationship argument described in #6753 seems like a nice approach. Thanks again for taking a second look at how to deal with multiple matches.

@philibe
Copy link

philibe commented Apr 19, 2023

Like @eipi10 and @carbo82 and @joranE , I also find this warnings are unnecessary especially if there are setted by default.

So I think for shiny and batches for some cases I will use this code to catch only some warnings but not all. Only for the warning message of dplyr:::rethrow_warning_join_relationship_many_to_many() at the beginning.

Tools.dplyr_suppressWarnings<- function (fct, ...) {
  tryCatch(
    withCallingHandlers (
      {
        fct(...)
      }, warning=function(w) {

        testw_rethrow_warning_join_relationship_many_to_many <-any(grepl("Detected an unexpected many-to-many relationship",w) )        
        testw_rethrow_warning_summarise_use_reframe <-any(grepl("Returning more \\(or less\\) than 1 row per .summarise",w) )
        
        if (testw_rethrow_warning_join_relationship_many_to_many ||  testw_rethrow_warning_summarise_use_reframe) {
          invokeRestart("muffleWarning")
        }    else {
          conditionMessage(w)
        }
      }),
    error = function(e) {
      print(e)
    }
  )
}

Tools.dplyr_inner_join<- function (...) {
  Tools.dplyr_suppressWarnings (dplyr::inner_join, ...)
}

Tools.dplyr_left_join<- function (...) {
  Tools.dplyr_suppressWarnings (dplyr::left_join, ...)
}
# etc

Tools.dplyr_summarise<- function (...) {
  Tools.dplyr_suppressWarnings (dplyr::summarise, ...)
}

df1 <- data.frame(x = c(1, 1, 2, 3), y = 1:4)

df2 <- data.frame(x = c(1, 1, 2), z = 1:3)

dplyr::inner_join(df1, df2)

#> Joining with `by = join_by(x)`
#> Warning in dplyr::inner_join(df1, df2): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 1 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#>   x y z
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 3 3

Tools.dplyr_inner_join(df1, df2)
#> Joining with `by = join_by(x)`
#>   x y z
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 3 3

Tools.dplyr_inner_join(df1, df2,by=c("x"))
#>   x y z
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 3 3

Created on 2023-04-19 with reprex v2.0.2

@DavisVaughan
Copy link
Member

If you really need a many-to-many join, why not just specify that as the relationship to silence the warning? dplyr::inner_join(df1, df2, relationship = "many-to-many")

@philibe
Copy link

philibe commented Apr 19, 2023

The cross joined datas are indeed a real concern and lot of issues of datas extracting are caused by unwanted cross joins.
I know this problem because I come from RDBMS and BI ecosystem. But when I write a SQL query, like a dplyr query, I know what I do.

For interactive usage or small app, these warnings are very useful. But for big apps and batches, I don't want to verify all of my old dplyr queries. And for some queries I've created cross joined datas to generate multiple lines.

And I think it's not the same problem than in Rust Language where they don't want memory leak, and they are therefore very very strict, but it's compiled language. For queries it's "only" business logic, and "only" warnings.

Note that I capture only dplyr:::rethrow_warning_join_relationship_many_to_many() but I want the others warnings. I will capture case-by-case.

While I am on the subject of new release :) , IMHO I think for a simple minor version there are too many deprecated features and too many supernumerary warnings for this dplyr 1.1.0 version :)

@philibe
Copy link

philibe commented Apr 19, 2023

The warnings are very very useful for log files. I look at them very often.

But if from one day to the next these log files are full of noise warnings, they could mask the potential real issues by their volumes.

It's because I think warnings are useful, that I want to have the choice to not see some of them, especially if they over-informe me. :)

@philibe
Copy link

philibe commented Apr 20, 2023

A little thing about multiple matches but in reframe()/summarise().

TL;DR

Could dplyr message suggest, in addition of this message, to also see group_by() ?

Detailed explanation and opinion

I didn't understand the message below until I look at the summarise feature with RDBMS point of View.

Returning more (or less) than 1 row per summarise() group was deprecated in dplyr 1.1.0. Please use reframe() instead"

In SQL we have to use GROUP BY for SUM() so one part of the warning about returning more than 1 row per summarise() could be to also suggest to group_by() :) .

In other terms, we have two possibilities depending on what the user wants, and not only reframe() --- with df1 <- data.frame( x = c(1, 1, 2, 3),y = c(1L, 2L, 3L, 4L)).

  • without one row by group: df1 %>% reframe(z = x + y)
  • with group before the summarise if we want to group. IMHO it's also a normal usage and has to be suggested if you want to suggest something
    • df1 %>% summarise( z = x + y, .by=c("x","y"))
    • df1 %>% group_by(x,y) %>%summarise( z = x + y, .groups= "drop_last")
A little tweak for fun

No no I will not use except for very very special case because it's normal to stop without rows, or to test if there is no rows, and we have to be aware of implicit group or to use reframe().

  • the tweak : df1 %>% (function (x) { if (nrow(x)==0) add_row(x) else x})(.) %>% rowwise(.) %>% summarise( z = x + y).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants