Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong ordering with french locale #325

Closed
knokknok opened this issue Mar 16, 2014 · 9 comments
Closed

Wrong ordering with french locale #325

knokknok opened this issue Mar 16, 2014 · 9 comments
Assignees
Milestone

Comments

@knokknok
Copy link

Sys.setlocale(locale="fr_FR.UTF-8")
library(plyr)
library(dplyr)
a <-
structure(list(Name = c("IODOHIPPURATE DE SODIUM [131 I]", "IODOHIPPURATE [123-I] DE SODIUM",
"IODURE DE POTASSIUM", "IODURE DE POTASSIUM", "IODURE [123 I] DE SODIUM",
"IODURE [131 I] DE SODIUM", "IODURE [131 I] DE SODIUM POUR THERAPIE"
)), .Names = "Name", row.names = c(10840L, 10841L, 10852L, 10853L,
10854L, 10855L, 10856L), class = "data.frame")
a[order(a$Name), , drop=FALSE]
plyr:::arrange(a,Name)
dplyr:::arrange(a,Name)
@hadley hadley added this to the v0.2 milestone Mar 17, 2014
@romainfrancois romainfrancois self-assigned this Mar 26, 2014
@romainfrancois
Copy link
Member

It appears in order to compare two R strings while taking into account locale information, we need to use Rf_ Scollate
https://github.com/wch/r-source/blob/8c53d05cdbc43ec9819bea0d7cd90f2c5116dab7/src/main/util.c#L1863

> cppFunction( "int strcoll_( CharacterVector lhs, CharacterVector rhs){ return Rf_Scollate( lhs[0], rhs[0]); }", includes = 'extern "C" { extern int Rf_Scollate(SEXP a, SEXP b) ; }'  )
>
> cppFunction( "int strcmp_( std::string lhs, std::string rhs){ return strcmp( lhs.c_str(), rhs.c_str() ) ; }" )
>
> strcoll_( "IODOHIPPURATE DE SODIUM [131 I]", "IODOHIPPURATE [123-I] DE SODIUM" )
[1] 1
> strcmp_( "IODOHIPPURATE DE SODIUM [131 I]", "IODOHIPPURATE [123-I] DE SODIUM" )
[1] -23

Rf_Scollate is one of these non api functions, so we can't really use it. especially on windows where it is explicitely hidden thanks to the Rdll.hide file.

@hadley
Copy link
Member

hadley commented Mar 26, 2014

So things will always be sorted with C locale? I think that's ok - we just need to make a note of it in the docs.

@romainfrancois
Copy link
Member

A win of Scollate I guess is that it would handle encoding, ...
On the same note, would be nice to have access to int Seql too ...

@romainfrancois
Copy link
Member

It might be slightly better if we call strcoll instead of strcmp. At least this would take into account the locale, but this would not exactly compare the two strings the way R would, Scollate would be the only way to get that, but it is unlikely that we'll get R core to make it part of R api. http://r.789695.n4.nabble.com/internal-string-comparison-Scollate-td4687584.html

@romainfrancois
Copy link
Member

So I guess for now, we're back at this being a documentation issue. @hadley not sure where this should be documented. Presumably this affects arrange and order_, and perhaps group_by.

@hadley hadley assigned hadley and unassigned romainfrancois Apr 1, 2014
@hadley
Copy link
Member

hadley commented Apr 1, 2014

I'll take care of it.

@hadley hadley closed this as completed in c80a087 Apr 9, 2014
@romainfrancois romainfrancois reopened this Nov 8, 2014
@romainfrancois romainfrancois modified the milestones: 0.3.1, v0.2 Nov 8, 2014
@romainfrancois
Copy link
Member

Reopening this as I think I can make it work now for real. The idea is to call R's order on the full column and use these indices in the internal sort algorithm and then look back in the original character vector column when materializing the result.

@romainfrancois
Copy link
Member

Well actually I need something along the lines of match( x, sort(x) )

@lock
Copy link

lock bot commented Sep 16, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants