-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Requiring all members of a character column to share the same encoding before performing a join is highly incompatible with character encoding behavior in R 3.1.2 (and prior). In almost every practical case where properly encoded data does contain non-ASCII characters, the current dplyr join functions will result in
Error: cannot join on columns 'name' x 'name' : found multiple encodings in character string
on execution.
The behavior in R that is troublesome is hinted at in help(Encoding):
ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.
Below is reproducible test case including additional lines showing the inability to remedy the situation using either Encoding()<- or iconv(), most likely because of the behavior described above:
# R 3.1.2 on Windows 7
# dplyr 0.3.0.2
library("dplyr")
#create two data frames x and y with some latin1 encodings
x<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),score=c(5,7,6),stringsAsFactors = FALSE)
Encoding(x$name) ##note the encodings
y<-data.frame(name=c("\xC9lise","Pierre","Fran\xE7ois"),attendance=c(8,10,9),stringsAsFactors = FALSE)
Encoding(y$name) ##note the encodings
#next statement throws "Error: cannot join on columns 'name' x 'name' : found multiple encodings in character string"
##can also try inner, semi, anti with same Error thrown
results<-left_join(x,y,by="name")
#I can't fix the encodings themselves with either iconv or Encoding<- because
#R won't force it for the strings with only ASCII characters in them
Encoding(x$name)<-"latin1"
Encoding(x$name) ##still has "unknown" for Pierre
x$name<-iconv(x$name,from = "latin1",to="latin1",mark = TRUE)
Encoding(x$name) ##still has "unknown" for Pierre
#base merge doesn't have this issue
results<-merge(x,y,by = "name")
results ##everything looks good