Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring back legacy arrange() behavior with dplyr.legacy_locale #6327

Merged

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jul 11, 2022

Follow up to #6297 and #6263

So the general idea is:

  • In dplyr 1.1.0, the default behavior of both group_by() and arrange() sort in the C locale
  • You can revert to legacy behavior, where both sort with the system locale, with the option dplyr.legacy_locale = TRUE
  • You can override the arrange() locale with an explicit .locale argument, or with dplyr.locale = "fr", but we prefer the former
  • You cannot override the C locale usage in group_by() (unless you revert to legacy behavior)

A few other notes about this PR:

  • Removed dplyr_locale(), which is no longer useful
  • Default .locale to NULL instead of dplyr_locale()
  • If multiple locale options are set, hierarchy of precedence is .locale, dplyr.locale, dplyr.legacy_locale

One question: Do we even need dplyr.locale anymore? Is being able to revert to the legacy behavior good enough? If we think dplyr.locale would stick around permanently, then it makes sense to keep it too, since I would like dplyr.legacy_locale to phase out. dplyr.locale might be useful as a permanent escape hatch for the case where you are calling something that calls arrange() for you and doesn't expose .locale. (Update: We removed dplyr.locale)

@DavisVaughan DavisVaughan requested a review from hadley July 12, 2022 00:35
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can drop dplyr.locale — I don't think we need some escape hatch to control arrange done elsewhere because in most cases, you can just re-arrange, and it's not likely to be that expensive (because you're probably rearranging for display, so there's not going to be that much data).

R/arrange.R Outdated
#' - Defaults to [dplyr_locale()], which uses the `"C"` locale unless this is
#' explicitly overriden. See the help page for [dplyr_locale()] for the
#' exact details.
#' - If `NULL`, the default, then the `"C"` locale will be used unless this is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow "legacy" here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only way to get legacy behavior in group_by() is to set the dplyr.legacy_locale global option, I think I'd like to keep that as the only way to get legacy behavior in arrange() too. I think it should be "all or nothing" legacy behavior that gets applied to both.

Comment on lines -114 to -117
if (length(dots) == 0L) {
out <- seq_len(nrow(data))
return(out)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this because we no longer need an early exit when there aren't any ... supplied. This was problematic before because of the paste0() recycling issue and some problems with order() with no inputs, but I've nicely worked around those now.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Aug 10, 2022

Okay I have removed support for dplyr.locale and refreshed the docs and tests to reflect that.

Also updated my original comment in this PR to reflect the new state of how arrange() and group_by() work in 1.1.0

I'm going to request one last review to ensure that this all looks good

R/arrange.R Outdated Show resolved Hide resolved
NEWS.md Show resolved Hide resolved
NEWS.md Outdated
change improves reproducibility across R sessions and operating systems

* `arrange()` and `group_by()` now both default to using the C locale when
ordering or grouping character vectors. This brings _substantial_ performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to include some benchmarks in the blog post.

I wonder if we should say that this makes dplyr consistent with data.table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass through the NEWS, simplifying the individual group-by and arrange bullets, since the combined bullet does a better job of talking about the C locale parts

I mentioned dplyr/data.table consistency in the 3rd bullet as part of a list of reasons why we made this change

R/arrange.R Outdated Show resolved Hide resolved
@@ -176,8 +166,22 @@ arrange_rows <- function(data,
})

directions <- directions[quo_names %in% names(data)]

if (is.null(locale) && dplyr_legacy_locale()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could simplify the logic by doing this earlier — if the legacy situation, we currently jump through a bunch of hoops to remove desc() calls before eventually adding them back in again. But maybe it's not worth it because of the transmute gymnastics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we add them back in?

I think it has to happen here, both paths do the transmute() bits and both paths need to strip out desc().

Unless im missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you mean at the end of dplyr_proxy_order_legacy(). CRAN dplyr does it this way too, so I'm inclined to keep it. I think figuring out the transmute stuff in combination with moving this code is going to be more challenging than its worth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good.

DavisVaughan and others added 9 commits August 16, 2022 17:16
- Matches legacy `group_by()` behavior which uses the same option
- No longer export `dplyr_locale()`, which is no longer useful
- Default `.locale` to `NULL` instead of `dplyr_locale()`
- If multiple locale options are set, hierarchy of precedence is `.locale`, `dplyr.locale`, `dplyr.legacy_locale`
Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
@DavisVaughan DavisVaughan merged commit 71d223c into tidyverse:main Aug 16, 2022
@DavisVaughan DavisVaughan deleted the feature/arrange-legacy-locale branch August 16, 2022 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants