Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrange()-ing words with accented vowels #1280

Closed
ekbrown opened this issue Jul 21, 2015 · 6 comments
Closed

arrange()-ing words with accented vowels #1280

ekbrown opened this issue Jul 21, 2015 · 6 comments
Assignees
Labels
Milestone

Comments

@ekbrown
Copy link

@ekbrown ekbrown commented Jul 21, 2015

arrange() interacts differently with data_frame than with data.frame when ordering words with accented vowels:

> df1 <- data.frame(word = c("casa", "árbol"))
> df1 %>% arrange(word)
   word
1 árbol
2  casa
> 
> df2 <- data_frame(word = c("casa", "árbol"))
> df2 %>% arrange(word)
Source: local data frame [2 x 1]

   word
1  casa
2 árbol
> 
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.2.9002

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10.9000 magrittr_1.5         R6_2.1.0             assertthat_0.1      
[5] parallel_3.2.1       DBI_0.3.1            tools_3.2.1          Rcpp_0.11.6         
Earls-MBP:~ earlbrown$ clang++ -v

Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)

Target: x86_64-apple-darwin14.4.0

Thread model: posix

Earls-MBP:~ earlbrown$ g++ -v

Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1

Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)

Target: x86_64-apple-darwin14.4.0

Thread model: posix
@hadley
Copy link
Member

@hadley hadley commented Jul 21, 2015

Can you please make the example reproducible? i.e. something that I can copy and paste directly into R. (And no need to include sessionInfo()`

@ekbrown
Copy link
Author

@ekbrown ekbrown commented Jul 21, 2015

library("dplyr")
words <- c("casa", "árbol", "zona", "órgano")
df1 <- data.frame(words)
df1 %>% arrange(words)
df2 <- data_frame(words)
df2 %>% arrange(words)

@mczapanskiy-usgs
Copy link

@mczapanskiy-usgs mczapanskiy-usgs commented Jul 30, 2015

I believe this is by design. data_frame preserves the class of its inputs so df2$words is character but df1$words is factor. The following code produces shows data.frame and data_frame behaving identically:

library("dplyr")
words <- c("casa", "árbol", "zona", "órgano")
df1 <- data.frame(words, stringsAsFactors = FALSE)
df1 %>% arrange(words)
#    words
# 1   casa
# 2   zona
# 3  árbol
# 4 órgano
df2 <- data_frame(words)
df2 %>% arrange(words)
#    words
# 1   casa
# 2   zona
# 3  árbol
# 4 órgano

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 30, 2015

Yes, this is just yet another victim of stringsAsFactors.

@ekbrown
Copy link
Author

@ekbrown ekbrown commented Jul 30, 2015

Weird. So, apparently I'll need to coerce words to factor to get the correct output. Why does base::sort correctly sort characters (and factors), but dplyr::arrange doesn't correctly sort characters, only factors?

library("dplyr")
words <- c("casa", "árbol", "zona", "órgano")
df2 <- data_frame(words)
df2 %>% arrange(as.factor(words))
# Source: local data frame [4 x 1]

#   words
# 1  árbol
# 2   casa
# 3 órgano
# 4   zona

sort(df2$words)
# [1] "árbol"  "casa"   "órgano" "zona"

@mczapanskiy-usgs
Copy link

@mczapanskiy-usgs mczapanskiy-usgs commented Jul 30, 2015

That's where it gets interesting. The dplyr vignette says arrange is a wrapper for order, but the following shows that the two functions get different results whether for data.frame or data_frame:

library("dplyr")
words <- c("casa", "árbol", "zona", "órgano")
df1 <- data.frame(words, stringsAsFactors = FALSE)
df1[order(df1$words), ]
# [1] "árbol"  "casa"   "órgano" "zona"  
df1 %>% arrange(words)
#    words
# 1   casa
# 2   zona
# 3  árbol
# 4 órgano
df2 <- data_frame(words)
df2[order(df2$words), ]
# Source: local data frame [4 x 1]
# 
#    words
# 1  árbol
# 2   casa
# 3 órgano
# 4   zona
arrange(df2, words)
# Source: local data frame [4 x 1]
# 
#    words
# 1   casa
# 2   zona
# 3  árbol
# 4 órgano

It turns out that the reason most likely has to do with the R locale vs. the C locale. The following is from the dplyr documentation for arrange:

Locales
Note that for local data frames, the ordering is done in C++ code which does not have access to the
local specific ordering usually done in R. This means that strings are ordered as if in the C locale.

I know we can use Sys.getlocale to get the R locale but I don't know how to check what C is using.

@romainfrancois romainfrancois self-assigned this Aug 2, 2015
@romainfrancois romainfrancois added this to the 0.5 milestone Aug 2, 2015
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants