Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Turkish dotted and dotless i properly #3011

Merged

Conversation

yutannihilation
Copy link
Member

Fix #3000

Turkish has two versions of i; dotted and dotless. Accordingly, the capital of i is İ, not I, which make ggplot2 to fail to find PositionIdentity and StatIdentity. The same thing can happen on dotless ı and I (c.f. http://www.i18nguy.com/unicode/turkish-i18n.html)

This seems a well-known problem, and ICU has a flag for this, but unfortunately, there's no way to use this via R's toupper().

It is also not language-sensitive, although there is a flag for whether to apply special mappings for use with Turkic (Turkish/Azerbaijani) text data.
(http://userguide.icu-project.org/transforms/casemappings)

After googling around, I found several options to fix this:

  1. Use chartr() instead of toupper() and tolower() (This PR)
  2. Use stringi::stri_trans_toupper() and stringi::stri_trans_tolower(), where we can specify locale.
  3. Modify the locale temporarily

For 2., stringi is cool, but adding it to the dependency is a bit heavy. For 3., it seems dangerous to mess the user's locale. So, from these, I think option 1 is rather reasonable, although not very cool. chartr() is around 4x slower than toupper(), but I hope this is acceptable considering that people in Turkish locale probably almost cannot use ggplot2... (I'm not familiar with this problem, so I might be wrong.)

bench::mark(
  chartr("abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "a"),
  toupper("a")
)
#> # A tibble: 2 x 10
#>   expression     min   mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 "chartr(\~   3.3us 7.03us  3.8us 732us   142357.        0B     0 10000
#> 2 "toupper(~   700ns 2.24us    1us 291us   446879.        0B     0 10000
#> # ... with 1 more variable: total_time <bch:tm>

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach seems good. It would be nice to have a unit test, but I can't think of an obvious way to simulate it.

R/utilities.r Outdated
# Use chartr() for safety since toupper() fails to convert i to I in Turkish locale
lower_ascii <- "abcdefghijklmnopqrstuvwxyz"
upper_ascii <- "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
tolower_ascii <- function(x) chartr(upper_ascii, lower_ascii, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this to_lower_ascii() and to_upper_ascii()?

And I think we should wrap to_lower() and to_upper() to throw an error, to make sure that this error doesn't arise again in the future.

@yutannihilation
Copy link
Member Author

Thanks, I added some unit tests. Do you think it's overkill to install language-pack-tr-base just for several test cases? If so, I'm OK to remove this. Anyway, I believe the fix is confirmed as valid since this commit passsed checks:

5f1bb92
https://travis-ci.org/tidyverse/ggplot2/jobs/457186678

@yutannihilation
Copy link
Member Author

Ah, sorry, it seems I'm doing wrong about detecting the available locales... Let me fix.

any(x == "tr_TR") isn't true.
https://travis-ci.org/tidyverse/ggplot2/builds/457214479#L5545

@hadley
Copy link
Member

hadley commented Nov 19, 2018

Sorry my comment was supposed to reinforce not having a unit test. I don't think it gets us much here as it's likely to be fragile, and we've reduced the chance by overwriting toupper()

R/utilities.r Outdated
}

toupper <- function(x) {
stop('Please use `to_upper_ascii()`, which works fine in Turkish locale.', call. = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in Turkish locale" -> "in all locales"

@yutannihilation
Copy link
Member Author

Oh, I got it wrong, sorry. Agreed.

@yutannihilation
Copy link
Member Author

Thanks for confirmation, @cystein!

@yutannihilation yutannihilation merged commit 669a606 into tidyverse:master Nov 20, 2018
@yutannihilation yutannihilation deleted the fix-turkish-dotted-i branch November 20, 2018 13:05
kevinhankens pushed a commit to kevinhankens/ggplot2 that referenced this pull request Nov 26, 2018
* Use chartr() instad of toupper() and tolower()
kevinhankens pushed a commit to kevinhankens/ggplot2 that referenced this pull request Nov 26, 2018
* Use chartr() instad of toupper() and tolower()
kevinhankens pushed a commit to kevinhankens/ggplot2 that referenced this pull request Nov 26, 2018
* Use chartr() instad of toupper() and tolower()
@lock
Copy link

lock bot commented May 19, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators May 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: Can't find stat called “identity”
3 participants