Handle Turkish dotted and dotless i properly #3011

yutannihilation · 2018-11-19T15:20:42Z

Turkish has two versions of i; dotted and dotless. Accordingly, the capital of i is İ, not I, which make ggplot2 to fail to find PositionIdentity and StatIdentity. The same thing can happen on dotless ı and I (c.f. http://www.i18nguy.com/unicode/turkish-i18n.html)

This seems a well-known problem, and ICU has a flag for this, but unfortunately, there's no way to use this via R's toupper().

It is also not language-sensitive, although there is a flag for whether to apply special mappings for use with Turkic (Turkish/Azerbaijani) text data.
(http://userguide.icu-project.org/transforms/casemappings)

After googling around, I found several options to fix this:

Use chartr() instead of toupper() and tolower() (This PR)
Use stringi::stri_trans_toupper() and stringi::stri_trans_tolower(), where we can specify locale.
Modify the locale temporarily

For 2., stringi is cool, but adding it to the dependency is a bit heavy. For 3., it seems dangerous to mess the user's locale. So, from these, I think option 1 is rather reasonable, although not very cool. chartr() is around 4x slower than toupper(), but I hope this is acceptable considering that people in Turkish locale probably almost cannot use ggplot2... (I'm not familiar with this problem, so I might be wrong.)

bench::mark(
  chartr("abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "a"),
  toupper("a")
)
#> # A tibble: 2 x 10
#>   expression     min   mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 "chartr(\~   3.3us 7.03us  3.8us 732us   142357.        0B     0 10000
#> 2 "toupper(~   700ns 2.24us    1us 291us   446879.        0B     0 10000
#> # ... with 1 more variable: total_time <bch:tm>

hadley

Approach seems good. It would be nice to have a unit test, but I can't think of an obvious way to simulate it.

hadley · 2018-11-19T16:59:10Z

R/utilities.r

+# Use chartr() for safety since toupper() fails to convert i to I in Turkish locale
+lower_ascii <- "abcdefghijklmnopqrstuvwxyz"
+upper_ascii <- "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+tolower_ascii <- function(x) chartr(upper_ascii, lower_ascii, x)


Can we make this to_lower_ascii() and to_upper_ascii()?

And I think we should wrap to_lower() and to_upper() to throw an error, to make sure that this error doesn't arise again in the future.

yutannihilation · 2018-11-19T22:28:27Z

Thanks, I added some unit tests. Do you think it's overkill to install language-pack-tr-base just for several test cases? If so, I'm OK to remove this. Anyway, I believe the fix is confirmed as valid since this commit passsed checks:

5f1bb92
https://travis-ci.org/tidyverse/ggplot2/jobs/457186678

yutannihilation · 2018-11-19T23:39:40Z

Ah, sorry, it seems I'm doing wrong about detecting the available locales... Let me fix.

any(x == "tr_TR") isn't true.
https://travis-ci.org/tidyverse/ggplot2/builds/457214479#L5545

hadley · 2018-11-19T23:42:11Z

Sorry my comment was supposed to reinforce not having a unit test. I don't think it gets us much here as it's likely to be fragile, and we've reduced the chance by overwriting toupper()

hadley · 2018-11-19T23:42:27Z

R/utilities.r

+}
+
+toupper <- function(x) {
+  stop('Please use `to_upper_ascii()`, which works fine in Turkish locale.', call. = FALSE)


"in Turkish locale" -> "in all locales"

yutannihilation · 2018-11-19T23:46:23Z

Oh, I got it wrong, sorry. Agreed.

yutannihilation · 2018-11-20T12:51:44Z

Thanks for confirmation, @cystein!

* Use chartr() instad of toupper() and tolower()

lock · 2019-05-19T13:19:07Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Use chartr() instad of toupper() and tolower()

94bec95

hadley reviewed Nov 19, 2018

View reviewed changes

yutannihilation added 6 commits November 20, 2018 06:28

Rename to to_upper_ascii() and to_lower_ascii()

7ef3778

Mask tolower() and toupper()

6589ed1

Test Turkish locale

92faa3f

Fix test

5f1bb92

Use unit tests only

46e7475

Add a NEWS bullet

fd5db11

hadley reviewed Nov 19, 2018

View reviewed changes

yutannihilation added 2 commits November 20, 2018 08:47

Remove unit tests about Turkish locale

a40f087

Improve the error message of overritten tolower() and toupper()

c5772d0

yutannihilation mentioned this pull request Nov 20, 2018

Error: Can't find stat called “identity” #3000

Closed

cystein approved these changes Nov 20, 2018

View reviewed changes

hadley approved these changes Nov 20, 2018

View reviewed changes

yutannihilation merged commit 669a606 into tidyverse:master Nov 20, 2018

yutannihilation deleted the fix-turkish-dotted-i branch November 20, 2018 13:05

kevinhankens pushed a commit to kevinhankens/ggplot2 that referenced this pull request Nov 26, 2018

Handle Turkish dotted and dotless i properly (tidyverse#3011)

c63cb80

* Use chartr() instad of toupper() and tolower()

kevinhankens pushed a commit to kevinhankens/ggplot2 that referenced this pull request Nov 26, 2018

Handle Turkish dotted and dotless i properly (tidyverse#3011)

592aca1

* Use chartr() instad of toupper() and tolower()

kevinhankens pushed a commit to kevinhankens/ggplot2 that referenced this pull request Nov 26, 2018

Handle Turkish dotted and dotless i properly (tidyverse#3011)

d370b43

* Use chartr() instad of toupper() and tolower()

mjskay mentioned this pull request Mar 25, 2019

geom_violinh fails with warning to "Please use to_lower_ascii()" with newly-released ggplot2 lionel-/ggstance#30

Closed

lock bot locked and limited conversation to collaborators May 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Turkish dotted and dotless i properly #3011

Handle Turkish dotted and dotless i properly #3011

yutannihilation commented Nov 19, 2018

hadley left a comment

hadley Nov 19, 2018

yutannihilation commented Nov 19, 2018

yutannihilation commented Nov 19, 2018

hadley commented Nov 19, 2018

hadley Nov 19, 2018

yutannihilation commented Nov 19, 2018

yutannihilation commented Nov 20, 2018

lock bot commented May 19, 2019

Handle Turkish dotted and dotless i properly #3011

Handle Turkish dotted and dotless i properly #3011

Conversation

yutannihilation commented Nov 19, 2018

hadley left a comment

Choose a reason for hiding this comment

hadley Nov 19, 2018

Choose a reason for hiding this comment

yutannihilation commented Nov 19, 2018

yutannihilation commented Nov 19, 2018

hadley commented Nov 19, 2018

hadley Nov 19, 2018

Choose a reason for hiding this comment

yutannihilation commented Nov 19, 2018

yutannihilation commented Nov 20, 2018

lock bot commented May 19, 2019