-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to set an option to globally turn off the multiple matches warning on joins? #6717
Comments
I see that you gave some text based description of your use cases, but could you please provide a simple reprex that shows a few of them? |
Below is an example of a typical use case: I have data frames with (1) student data (one row per student), (2) term data (one row per student per term), and (3) course enrollment data (one row per student per term per course enrollment), and I want join the student data to the term data and then join the student+term data to the course enrollment data. I can silence the warning if I put the course enrollment data in the The example below is just one among many use cases where I expect multiple matches. Over the years I've worked with data in several fields and I'm struggling to find an instance where I didn't typically expect multiple matches. library(tidyverse)
student = structure(list(
student.id = structure(c(3L, 1L, 4L, 2L), levels = c("S1", "S2", "S3", "S4"), class = "factor"),
cohort = c("First-time", "Transfer", "First-time", "Transfer"),
entering.term = c("Fall 2019", "Fall 2020", "Fall 2020", "Spring 2021")),
row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
term = structure(list(
student.id = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 1L, 4L, 4L, 4L, 4L, 4L),
levels = c("S1", "S2", "S3", "S4"), class = "factor"),
term = c("Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022", "Fall 2020",
"Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022", "Fall 2020",
"Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022", "Fall 2020",
"Spring 2021", "Fall 2021", "Spring 2022", "Fall 2022"),
terms.elapsed = c(1, 2, 3, 4, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
unit.load = c(16, 18, 14, NA, 13, 15, 15, 15, 15, 16, 13, 0, 0, NA, 15, 15,
15, 9, 0)),
row.names = c(NA, -19L), class = c("tbl_df", "tbl", "data.frame"))
course = structure(list(
student.id = structure(c(4L, 4L, 4L, 2L, 2L, 3L, 3L, 1L, 1L),
levels = c("S1", "S2", "S3", "S4"), class = "factor"),
instructor.id = structure(c(1L, 1L, 4L, 3L, 3L, 2L, 2L, 3L, 3L),
levels = c("I1", "I2", "I3", "I4"), class = "factor"),
course = c("ENGL 10", "ENGL 11", "ENGL 20", "POLS 130", "POLS 140",
"HIST 17A", "HIST 171B", "POLS 198B", "POLS 198B"),
class.section = c("36", "36", "18", "40", "40", "03", "01", "01", "01"),
term = c("Fall 2020", "Spring 2021", "Spring 2022", "Fall 2021", "Fall 2021", "Fall 2020",
"Spring 2021", "Fall 2020", "Spring 2021"),
grade = c("A", "A-", "B-", "C-", "B", "B-", "CR", "CR", "A")),
row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
# One row per student
student
#> # A tibble: 4 × 3
#> student.id cohort entering.term
#> <fct> <chr> <chr>
#> 1 S3 First-time Fall 2019
#> 2 S1 Transfer Fall 2020
#> 3 S4 First-time Fall 2020
#> 4 S2 Transfer Spring 2021
# One row per student per term
# (Multiple rows per student)
term
#> # A tibble: 19 × 4
#> student.id term terms.elapsed unit.load
#> <fct> <chr> <dbl> <dbl>
#> 1 S2 Spring 2021 1 16
#> 2 S2 Fall 2021 2 18
#> 3 S2 Spring 2022 3 14
#> 4 S2 Fall 2022 4 NA
#> 5 S3 Fall 2020 3 13
#> 6 S3 Spring 2021 4 15
#> 7 S3 Fall 2021 5 15
#> 8 S3 Spring 2022 6 15
#> 9 S3 Fall 2022 7 15
#> 10 S1 Fall 2020 1 16
#> 11 S1 Spring 2021 2 13
#> 12 S1 Fall 2021 3 0
#> 13 S1 Spring 2022 4 0
#> 14 S1 Fall 2022 5 NA
#> 15 S4 Fall 2020 1 15
#> 16 S4 Spring 2021 2 15
#> 17 S4 Fall 2021 3 15
#> 18 S4 Spring 2022 4 9
#> 19 S4 Fall 2022 5 0
# One row per course enrollment per term
# (Multiple rows for both students and instructors within and across terms)
course
#> # A tibble: 9 × 6
#> student.id instructor.id course class.section term grade
#> <fct> <fct> <chr> <chr> <chr> <chr>
#> 1 S4 I1 ENGL 10 36 Fall 2020 A
#> 2 S4 I1 ENGL 11 36 Spring 2021 A-
#> 3 S4 I4 ENGL 20 18 Spring 2022 B-
#> 4 S2 I3 POLS 130 40 Fall 2021 C-
#> 5 S2 I3 POLS 140 40 Fall 2021 B
#> 6 S3 I2 HIST 17A 03 Fall 2020 B-
#> 7 S3 I2 HIST 171B 01 Spring 2021 CR
#> 8 S1 I3 POLS 198B 01 Fall 2020 CR
#> 9 S1 I3 POLS 198B 01 Spring 2021 A
# Gives multiple-match warnings
d1 = left_join(student, term)
#> Joining with `by = join_by(student.id)`
#> Warning in left_join(student, term): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 1 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#> warning.
d1 = left_join(d1, course)
#> Joining with `by = join_by(student.id, term)`
#> Warning in left_join(d1, course): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 17 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#> warning.
# No warnings when wrapped in reduce
d1a = reduce(list(student, term, course), left_join)
#> Joining with `by = join_by(student.id)`
#> Joining with `by = join_by(student.id, term)`
# Reverse order of join data frames to stop warning
d2 = left_join(term, student)
#> Joining with `by = join_by(student.id)`
d2 = left_join(course, d2)
#> Joining with `by = join_by(student.id, term)`
# Right join also gives multiple-match warning
d3 = right_join(student, term)
#> Joining with `by = join_by(student.id)`
#> Warning in right_join(student, term): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 1 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#> warning.
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.0 purrr_1.0.1
#> [5] readr_2.1.3 tidyr_1.3.0 tibble_3.1.8 ggplot2_3.4.0
#> [9] tidyverse_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.9.0 tidyselect_1.2.0 xfun_0.37
#> [4] haven_2.5.1 gargle_1.3.0 colorspace_2.1-0
#> [7] vctrs_0.5.2 generics_0.1.3 htmltools_0.5.4
#> [10] yaml_2.3.7 utf8_1.2.3 rlang_1.0.6
#> [13] R.oo_1.25.0 pillar_1.8.1 glue_1.6.2
#> [16] withr_2.5.0 DBI_1.1.3 R.utils_2.12.2
#> [19] dbplyr_2.3.0 readxl_1.4.1 modelr_0.1.10
#> [22] R.cache_0.16.0 lifecycle_1.0.3 munsell_0.5.0
#> [25] gtable_0.3.1 cellranger_1.1.0 rvest_1.0.3
#> [28] R.methodsS3_1.8.2 evaluate_0.20 knitr_1.42
#> [31] tzdb_0.3.0 fastmap_1.1.0 fansi_1.0.4
#> [34] broom_1.0.3 backports_1.4.1 scales_1.2.1
#> [37] googlesheets4_1.0.1 jsonlite_1.8.4 fs_1.6.0
#> [40] hms_1.1.2 digest_0.6.31 stringi_1.7.12
#> [43] grid_4.2.2 cli_3.6.0 tools_4.2.2
#> [46] magrittr_2.0.3 crayon_1.5.2 pkgconfig_2.0.3
#> [49] ellipsis_0.3.2 xml2_1.3.3 timechange_0.2.0
#> [52] reprex_2.0.2 googledrive_2.0.0 lubridate_1.9.1
#> [55] assertthat_0.2.1 rmarkdown_2.20 httr_1.4.4
#> [58] rstudioapi_0.14 R6_2.5.1 compiler_4.2.2 Created on 2023-02-13 with reprex v2.0.2 |
I also find this warning unnecessary. The left_join() function is quite explicit that it is potentially matching more than one rows. |
@carbo82 we actually found and believe that many people do not think so https://twitter.com/hadleywickham/status/1435331147978903558?s=20 |
fwiw, I also think the default on this is wrong. I think allowing the ability to warn on multiple matches in a join is great, but the default should be no warning, and users should be able to request warnings when they want them. One-to-many joins are so common that warning on that by default is in my opinion excessive. I could see an argument for warning by default in |
You changed a consequential default based on what some folks who happened to be on Twitter and see Hadley's tweet in early September 2021 told you? |
I think it's great that Hadley periodically asks for feedback on social media to get a broader sense of user's expectations and use cases. In this particular instance, I'm skeptical that people's expectations around the behavior of |
It is indeed great that the Tidyverse team solicits feedback from users. It's not so great when they make a consequential decision based on the assumption that a convenience sample of Twitter users is representative of frequent users of Furthermore, there was never a need to impose a one-size-fits-all change that's bound to inconvenience a substantial number of Tidyverse users. If the |
Thanks for taking another look at this Davis and for your thoughtful discussion in #6731. |
@eipi10 Obviously, we don't make decisions solely on the basis of twitter polls (but they were a very useful tool for roughly gauging how the community feels about things). Since we were also worried about this change, we described it in pre-release blog post. We didn't hear any negative feedback prior to release, so it's not clear what we could have done differently. |
The new |
Like @eipi10 and @carbo82 and @joranE , I also find this warnings are unnecessary especially if there are setted by default. So I think for shiny and batches for some cases I will use this code to catch only some warnings but not all. Only for the warning message of Tools.dplyr_suppressWarnings<- function (fct, ...) {
tryCatch(
withCallingHandlers (
{
fct(...)
}, warning=function(w) {
testw_rethrow_warning_join_relationship_many_to_many <-any(grepl("Detected an unexpected many-to-many relationship",w) )
testw_rethrow_warning_summarise_use_reframe <-any(grepl("Returning more \\(or less\\) than 1 row per .summarise",w) )
if (testw_rethrow_warning_join_relationship_many_to_many || testw_rethrow_warning_summarise_use_reframe) {
invokeRestart("muffleWarning")
} else {
conditionMessage(w)
}
}),
error = function(e) {
print(e)
}
)
}
Tools.dplyr_inner_join<- function (...) {
Tools.dplyr_suppressWarnings (dplyr::inner_join, ...)
}
Tools.dplyr_left_join<- function (...) {
Tools.dplyr_suppressWarnings (dplyr::left_join, ...)
}
# etc
Tools.dplyr_summarise<- function (...) {
Tools.dplyr_suppressWarnings (dplyr::summarise, ...)
}
df1 <- data.frame(x = c(1, 1, 2, 3), y = 1:4)
df2 <- data.frame(x = c(1, 1, 2), z = 1:3)
dplyr::inner_join(df1, df2)
#> Joining with `by = join_by(x)`
#> Warning in dplyr::inner_join(df1, df2): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 1 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#> "many-to-many"` to silence this warning.
#> x y z
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 3 3
Tools.dplyr_inner_join(df1, df2)
#> Joining with `by = join_by(x)`
#> x y z
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 3 3
Tools.dplyr_inner_join(df1, df2,by=c("x"))
#> x y z
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 3 3 Created on 2023-04-19 with reprex v2.0.2 |
If you really need a many-to-many join, why not just specify that as the |
The cross joined datas are indeed a real concern and lot of issues of datas extracting are caused by unwanted cross joins. For interactive usage or small app, these warnings are very useful. But for big apps and batches, I don't want to verify all of my old dplyr queries. And for some queries I've created cross joined datas to generate multiple lines. And I think it's not the same problem than in Rust Language where they don't want memory leak, and they are therefore very very strict, but it's compiled language. For queries it's "only" business logic, and "only" warnings. Note that I capture only While I am on the subject of new release :) , IMHO I think for a simple minor version there are too many deprecated features and too many supernumerary warnings for this dplyr 1.1.0 version :) |
The warnings are very very useful for log files. I look at them very often. But if from one day to the next these log files are full of noise warnings, they could mask the potential real issues by their volumes. It's because I think warnings are useful, that I want to have the choice to not see some of them, especially if they over-informe me. :) |
A little thing about multiple matches but in TL;DRCould dplyr message suggest, in addition of this message, to also see Detailed explanation and opinionI didn't understand the message below until I look at the summarise feature with RDBMS point of View.
In SQL we have to use In other terms, we have two possibilities depending on what the user wants, and not only
A little tweak for funNo no I will not use except for very very special case because it's normal to stop without rows, or to test if there is no rows, and we have to be aware of implicit group or to use
|
dplyr
1.1 has a new argumentmultiple
in the join functions that, by default, throws a warning if a row inx
matches more than one row iny
. You have to explicitly setmultiple="all"
in every join in order to silence the warning. According to the help, this default was chosen because multiple matches are "usually surprising." In the data I work with (education) multiple matches are the norm (students attend in multiple terms and take multiple courses in each term; instructors have multiple students, teach more than one course and perhaps multiple sections of a course, etc.). The constant warnings are really annoying, but it would likewise be annoying to have to explicitly set themultiple
argument every time I do a join (including many, many joins across existing functions in our internal packages that would need to be updated).Ideally, the default would be not to warn. If you're not open to that change, is there any way to set an option at the top of a script that disables warnings related to multiple matches, and, if not, would you be willing to add such an option?
The text was updated successfully, but these errors were encountered: