Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_join, left_join, etc. have almost unavoidable failure on properly encoded character columns #769

Closed
mdsosa opened this issue Nov 12, 2014 · 6 comments
Assignees
Labels
Milestone

Comments

@mdsosa
Copy link

@mdsosa mdsosa commented Nov 12, 2014

Requiring all members of a character column to share the same encoding before performing a join is highly incompatible with character encoding behavior in R 3.1.2 (and prior). In almost every practical case where properly encoded data does contain non-ASCII characters, the current dplyr join functions will result in

Error: cannot join on columns 'name' x 'name' : found multiple encodings in character string
on execution.

The behavior in R that is troublesome is hinted at in help(Encoding):

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.

Below is reproducible test case including additional lines showing the inability to remedy the situation using either Encoding()<- or iconv(), most likely because of the behavior described above:

# R 3.1.2 on Windows 7
# dplyr 0.3.0.2
library("dplyr")

#create two data frames x and y with some latin1 encodings
x<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),score=c(5,7,6),stringsAsFactors = FALSE)
Encoding(x$name)  ##note the encodings

y<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),attendance=c(8,10,9),stringsAsFactors = FALSE)
Encoding(y$name)  ##note the encodings

#next statement throws "Error: cannot join on columns 'name' x 'name' : found multiple encodings in character string"
##can also try inner, semi, anti with same Error thrown
results<-left_join(x,y,by="name")

#I can't fix the encodings themselves with either iconv or Encoding<- because
#R won't force it for the strings with only ASCII characters in them
Encoding(x$name)<-"latin1"
Encoding(x$name) ##still has "unknown" for Pierre

x$name<-iconv(x$name,from = "latin1",to="latin1",mark = TRUE)
Encoding(x$name)  ##still has "unknown" for Pierre

#base merge doesn't have this issue
results<-merge(x,y,by = "name")
results  ##everything looks good
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Nov 18, 2014

I get:

> Encoding(x$name)  ##note the encodings
[1] "unknown" "unknown" "unknown"
> Encoding(y$name)  ##note the encodings
[1] "unknown" "unknown" "unknown"
> results<-left_join(x,y,by="name")
> results
         name score attendance
1    \xc9lise     5          8
2      Pierre     7         10
3 Fran\xe7ois     6          9

@hadley
Copy link
Member

@hadley hadley commented Nov 18, 2014

I get:

> Encoding(x$name)  ##note the encodings
[1] "latin1"  "unknown" "latin1" 
> Encoding(y$name)  ##note the encodings
[1] "unknown" "unknown" "unknown"

@hadley
Copy link
Member

@hadley hadley commented Dec 8, 2014

@romainfrancois could you please take another look? I'm not sure why you and I are getting different results. Does this give the same results for you?

Encoding(c("a", "å"))
#> [1] "unknown" "UTF-8"

Unknown = ASCII so it's ok to be mixed with UTF-8 or latin1

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Dec 10, 2014

I get:

> Encoding(c("a", "å"))
[1] "unknown" "UTF-8"

But still:

> #create two data frames x and y with some latin1 encodings
> x<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),score=c(5,7,6),stringsAsFactors = FALSE)
> Encoding(x$name)  ##note the encodings
[1] "unknown" "unknown" "unknown"
>
> y<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),attendance=c(8,10,9),stringsAsFactors = FALSE)
> Encoding(y$name)  ##note the encodings
[1] "unknown" "unknown" "unknown"

I'll try to figure something out to make _join functions less of an issue in multiple encoding cases

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Dec 14, 2014

Done. What I've done is bind strings from the two vectors in a big vector, run rank( x, ties.method = "min" ) on this big vector and then use these integers to implement equality and comparison.

This is kind of memory expensive, but R does not give us api to compare strings individually at the C level.

@mdsosa
Copy link
Author

@mdsosa mdsosa commented Dec 15, 2014

Sounds great. In reference to the early observations where @romainfrancois kept getting different results for Encoding(x$name) , etc.. I should also note I am in RStudio 0.98.1074, if that can make a difference. Also, I have getOption("encoding") as the default "native.enc".

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants