Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::distinct doesn't match identical strings with different encoding #1179

Closed
cjyetman opened this issue May 29, 2015 · 1 comment
Closed

dplyr::distinct doesn't match identical strings with different encoding #1179

cjyetman opened this issue May 29, 2015 · 1 comment
Assignees
Labels
Milestone

Comments

@cjyetman
Copy link
Contributor

@cjyetman cjyetman commented May 29, 2015

This is very similar to #603 but with distinct instead of inner_join.

If a column has different encodings on otherwise identical strings, distinct will return both rows as distinct. This is inconsistent with many other methods of checking for identical rows or strings in base R.

Seems like an error or warning when there are mixed encodings, or forcing all strings to UTF-8 beforehand would be preferable to silently returning identical rows as distinct. Note: in the comments of #603, it appears that the solution was to make inner_join stop/error if the encodings of all strings were not the same, however, running the test example from the first comment over there, inner_join (dplyr_0.4.1) seems to return the same result as merge now with no error or warning.

Example:

x <- c("Montréal", "Montréal")
Encoding(x[2]) <- ""
Encoding(x)  # [1] "UTF-8"   "unknown"
df <- data.frame(x=x, stringsAsFactors=F)
unique(df)
identical(df[1,1], df[2,1])
factor(df[,1])
df[1,1]==df[2,1]
dplyr::distinct(df) # only one that doesn't match them

just in case, here's my session info:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 magrittr_1.5    assertthat_0.1 
[4] parallel_3.2.0  DBI_0.3.1       tools_3.2.0    
[7] Rcpp_0.11.6
@romainfrancois romainfrancois self-assigned this Jul 8, 2015
@romainfrancois romainfrancois added this to the 0.5 milestone Jul 8, 2015
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 8, 2015

Now getting this:

> x <- c("Montréal", "Montréal")
>   Encoding(x[2]) <- ""
>   df <- data_frame(x=x)
>   distinct(df)
Source: local data frame [1 x 1]

         x
1 Montréal

The price to pay for this was a call to rank on the character vector, which internally figures it out.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants