Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining data frames with different internal character encodings (UTF-8 vs. latin1) fails #1513

Closed
huftis opened this issue Nov 5, 2015 · 4 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@huftis
Copy link

huftis commented Nov 5, 2015

When the join column(s) of two data frames has the same name but different internal encodings, the dplyr join functions fail. Example:

library(dplyr)
load(file("http://huftis.org/nedlasting/r/dplyr-encoding-join-problem.Rdata"))

The two data frames can be joined by the ‘løpenummer’ column:

> str(d1)
Classes ‘tbl_df’ and 'data.frame':  6 obs. of  2 variables:
 $ løpenummer: int  1 2 3 4 5 6
 $ foo       : int  176 114 148 154 181 120
> str(d2)
Classes ‘tbl_df’ and 'data.frame':  6 obs. of  2 variables:
 $ løpenummer: int  4 5 6 7 8 9
 $ baz       : int  24 46 45 10 35 16

But actually trying to join them fails with an error message:

> d1 %>% left_join(d2, by="løpenummer")
Error: cannot join on columns 'løpenummer' x 'løpenummer': index out of bounds 

The left_join() function does recognise that the data frames can be joined by ‘løpenummer’, but still fails:

> d1 %>% left_join(d2)
Joining by: "løpenummer"   
Error: cannot join on columns 'l�penummer' x 'l�penummer': index out of bounds 

Note that the ‘ø’ character is different in this error message. (On my system it’s shown as the ‘unknown character’ glyph.)

The two ‘løpenummer’ variables do have the exact same name, as confirmed by:

> names(d1)==names(d2)
[1]  TRUE FALSE

But the character encoding is different:

> Encoding(names(d1))
[1] "UTF-8"   "unknown"
> Encoding(names(d2))
[1] "latin1"  "unknown"

I observe this bug on R 3.2.2, with the latest Git version (2015-11-05) of dplyr, on a 64-bits Linux system (though the problem is also present in the released 0.4.3 version in R 3.2.1 on a Windows system).

@krlmlr
Copy link
Member

krlmlr commented Feb 18, 2016

Related: A different error in a similar setting (UTF-8 vs. unknown encoding), seen on Ubuntu 15.10 in UTF locale: http://rpubs.com/krlmlr/dplyr-encoding-left-join

@hadley hadley added the bug an unexpected problem or unintended behavior label Mar 1, 2016
@hadley hadley added this to the 0.5 milestone Mar 1, 2016
@hadley
Copy link
Member

hadley commented Mar 1, 2016

Another one for you @romainfrancois - I know how you enjoy encoding :)

@romainfrancois
Copy link
Member

This is I think fixed now. @huftis can you confirm please, I'm always somewhat tense about encodings.
@hadley you know how I could test this ? IIRC having "løpenummer" in a test file will cause some other issue on e.g. windows.

@hadley
Copy link
Member

hadley commented May 1, 2016

You can use unicode escapes, which do work on windows:

# stringi::stri_escape_unicode("løpenummer")
df <- tibble::data_frame(x = 1:3)
names(df) <- "l\u00f8penummer"

sicarul pushed a commit to sicarul/dplyr that referenced this issue May 4, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants