Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode numbers #1315

Closed
bmschmidt opened this issue Aug 12, 2015 · 6 comments
Closed

Unicode numbers #1315

bmschmidt opened this issue Aug 12, 2015 · 6 comments
Assignees
Milestone

Comments

@bmschmidt
Copy link

@bmschmidt bmschmidt commented Aug 12, 2015

inner_join seems to be a little too aggressive with unicode character joins, assuming that (at least) arabic digits and romanized arabic digits are the same even though in-console they register as different.

"٣"=="3"
a= data.frame(character = c("٣"),set=c("arabic_the_language"),stringsAsFactors=F)
b = data.frame(character = c("3"),set=c("arabic_the_numeral_set"),stringsAsFactors = F)
b %>% inner_join(a,by=c("character"))
a %>% inner_join(b,by=c("character"))

In case this is a result of my local settings, the results I get:

> "٣"=="3"
[1] FALSE
> a= data.frame(character = c("٣"),set=c("arabic_the_language"),stringsAsFactors=F)
> b = data.frame(character = c("3"),set=c("arabic_the_numeral_set"),stringsAsFactors = F)
> b %>% inner_join(a,by=c("character"))
  character                  set.x               set.y
1         3 arabic_the_numeral_set arabic_the_language
> a %>% inner_join(b,by=c("character"))
  character               set.x                  set.y
1         ٣ arabic_the_language arabic_the_numeral_set

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.2

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.0.1       assertthat_0.1 parallel_3.2.0 tools_3.2.0   
[6] DBI_0.3.1      Rcpp_0.11.6   

@bmschmidt
Copy link
Author

@bmschmidt bmschmidt commented Aug 12, 2015

Apologies for submitting an incomplete version of this a couple minutes ago--I reflexively tried to send the code block to RStudio with command-enter, and instead prematurely submitted the issue in github. I've updated the example.

@romainfrancois romainfrancois self-assigned this Aug 12, 2015
@romainfrancois romainfrancois added this to the 0.5 milestone Aug 12, 2015
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Aug 12, 2015

Looks like a weird 🐞 in R.

> rank( c("٣", "3"), ties.method="min", na.last ="keep")
[1] 1 1

This R call is what we use internally deep down to order unique strings from the two columns, so that we have the ordering as R sees it.

@hadley
Copy link
Member

@hadley hadley commented Aug 13, 2015

I'll shoot an email to R-devel

@hadley
Copy link
Member

@hadley hadley commented Aug 13, 2015

So it turns out that this rank behaviour is correct for .en_US.UTF-8. @romainfrancois the problem is strings might rank equally, but be different :/ Not sure how to work around this.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Aug 13, 2015

At least unique does not get fooled by that :

> unique( c("٣", "3") )
[1] "٣" "3"

Short of something better, we could error if length(unique(.)) is greater than max(rank(..)). Not sure otherwise how to deal with this.

The whole point of using rank on the unique SEXP string pointers is that we get the ranks as R sees them (i.e. respecting the encoding etc ...).

I'll try to leverage the fact that if R does not differentiate them by hashing, it can still figure out that they are different

> x <- c("٣", "3")
> outer( x, x, "==" )
      [,1]  [,2]
[1,]  TRUE FALSE
[2,] FALSE  TRUE

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Aug 14, 2015

Turns out sort gives me better results:

> x <- c("٣", "3")
> match( x, sort(x) )

Perhaps at the expense of some performance.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants