Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings with and without encoding are not matched when joining #1885

Closed
krlmlr opened this issue Jun 4, 2016 · 5 comments
Closed

Strings with and without encoding are not matched when joining #1885

krlmlr opened this issue Jun 4, 2016 · 5 comments
Labels
feature a feature request or enhancement
Milestone

Comments

@krlmlr
Copy link
Member

krlmlr commented Jun 4, 2016

encoded <- function(x) {
  Encoding(x) <- "UTF-8"
  x
}

unencoded <- function(x) {
  Encoding(x) <- "unknown"
  x
}

dplyr::left_join(tibble::data_frame(a = encoded("ä")),
                 tibble::data_frame(a = unencoded("ä"), b = 1))
##       a     b
##   <chr> <dbl>
## 1     ä    NA

Expected: b == 1 in the result.

This can be mitigated by using r_match() instead of match() in the JoinStringStringVisitor, but I wonder if dplyr should warn instead.

@hadley
Copy link
Member

hadley commented Jun 4, 2016

Maybe tibble should force encoding to utf-8? Dplyr would still need to warn but that would mitigate some of the hassle

@krlmlr
Copy link
Member Author

krlmlr commented Jun 4, 2016

In my case, data came from a CSV file read using read.csv(); but readr already takes care of the encoding. If this done by tibble, perhaps readr doesn't have to do it anymore.

match() is Rcpp::match(), and Rcpp seems to respect the declared encoding (RcppCore/Rcpp#189, RcppCore/Rcpp#466). To me, r_match() looks like a safe, if perhaps slower, alternative. Or we fix upstream.

@hadley
Copy link
Member

hadley commented Jun 7, 2016

Would be good to fix upstream

@krlmlr
Copy link
Member Author

krlmlr commented Nov 7, 2016

Doesn't look like an upstream fix will become available soon. We should just make sure that column data is always UTF-8.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 8, 2016

Related: #1950, column names with non-native encoding.

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants