Skip to content

inner_join, left_join, etc. have almost unavoidable failure on properly encoded character columns #769

@mdsosa

Description

@mdsosa

Requiring all members of a character column to share the same encoding before performing a join is highly incompatible with character encoding behavior in R 3.1.2 (and prior). In almost every practical case where properly encoded data does contain non-ASCII characters, the current dplyr join functions will result in

Error: cannot join on columns 'name' x 'name' : found multiple encodings in character string
on execution.

The behavior in R that is troublesome is hinted at in help(Encoding):

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.

Below is reproducible test case including additional lines showing the inability to remedy the situation using either Encoding()<- or iconv(), most likely because of the behavior described above:

# R 3.1.2 on Windows 7
# dplyr 0.3.0.2
library("dplyr")

#create two data frames x and y with some latin1 encodings
x<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),score=c(5,7,6),stringsAsFactors = FALSE)
Encoding(x$name)  ##note the encodings

y<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),attendance=c(8,10,9),stringsAsFactors = FALSE)
Encoding(y$name)  ##note the encodings

#next statement throws "Error: cannot join on columns 'name' x 'name' : found multiple encodings in character string"
##can also try inner, semi, anti with same Error thrown
results<-left_join(x,y,by="name")

#I can't fix the encodings themselves with either iconv or Encoding<- because
#R won't force it for the strings with only ASCII characters in them
Encoding(x$name)<-"latin1"
Encoding(x$name) ##still has "unknown" for Pierre

x$name<-iconv(x$name,from = "latin1",to="latin1",mark = TRUE)
Encoding(x$name)  ##still has "unknown" for Pierre

#base merge doesn't have this issue
results<-merge(x,y,by = "name")
results  ##everything looks good

Metadata

Metadata

Labels

bugan unexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions