Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use String instead of Symbol #2388

Merged
merged 25 commits into from Feb 21, 2017
Merged

Use String instead of Symbol #2388

merged 25 commits into from Feb 21, 2017

Conversation

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 28, 2017

Currently breaks tests.

Fixes #1950.

Fixes #2387.

@krlmlr krlmlr added this to the data frame 1 milestone Feb 10, 2017
@krlmlr krlmlr added this to the data frame 1 milestone Feb 10, 2017
@krlmlr krlmlr force-pushed the b-#1950-symbols branch from 3921662 to 502ca18 Feb 10, 2017
@krlmlr krlmlr changed the title WIP: Use String instead of Symbol Use String instead of Symbol Feb 13, 2017
@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 18, 2017

@lionel-: Would you mind reviewing this, especially the definition of the new SymbolString and SymbolVector classes? The purpose of these classes is to clearly indicate the strings that are supposed to occur in the context of a column name, or a list of column names. This should get rid of encoding problems with column names once and forever.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

I'm not sure this helps solving the problem. What we need is:

  • to convert symbols to the native encoding before transforming them to strings.
  • to tag strings as UTF-8 before returning them to R in case we reencode them.

This PR does not seem to address the second issue (but I don't know if that's currently a problem in dplyr). As for the first issue, it should be mostly resolved on the R side by using rlang's symbol constructors, and could be handled on the C++ side with a simple Symbol make_symbol(SEXP x) function.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

Thanks. To me, it looks like make_symbol() is not sufficient, because it doesn't allow storing column names that cannot be represented in the native encoding (#1950). My approach allows this by avoiding conversion to Symbol altogether to the greatest extent possible. The R side should be doing the same, i.e., convert to character as soon as possible and avoid converting to name if possible. I'm happy to look at your WIP concerning the conversion to rlang if you want to share it.

I'm wrapping String and CharacterVector, because

  1. Rcpp's implementation currently doesn't handle encoding very well
  2. this clearly shows the intent of a variable that holds column name(s)

Tagging strings on input (during construction of SymbolVector and SymbolString) sounds like a good idea. This makes sure they are always tagged on output. I'd like to approach this after merging this PR.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

To me, it looks like make_symbol() is not sufficient, because it doesn't allow storing column names that cannot be represented in the native encoding (#1950).

ohh, I was assuming we could convert to native and back to UTF-8 but you're right that this is not the case. UTF-8 to native is potentially a lossy conversion. So the problem is harder: we should never convert UTF-8 strings to native, even when we convert them to symbols. But we do need a way to get back a string correctly tagged as UTF-8 when we convert the symbol back to a string...

Would it be reasonable to add an encoding attribute when we construct symbols? And as_character() would know how to deal with them, and your wrapper classes would use this info as well.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

Would it be reasonable to add an encoding attribute when we construct symbols?

This would change the symbol by reference, so not an option. Wrapping the symbol in a list does not feel right either, but we may not have any other option. Maybe we should try providing an upstream patch.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

This would change the symbol by reference, so not an option.

hmm, maybe not a bad thing, actually?

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

I don't mind the lossy conversion if we can retrieve the original information later on. I'd prefer to have symbols in the native encoding, because this is consistent with what R expects.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

but we can't retrieve the original information, since the conversion is lossy.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

Didn't you suggest to store the original information as an UTF-8 encoded string in an attribute?

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

yes, and we shouldn't convert to native. R does not convert to native it just forgets the UTF-8 tag.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

I'm reasonably certain we should convert to native, because R or the user might rely on it at some point. Not converting is calling for trouble -- imagine a user will assume that he can call eval() on a symbol we create.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

imagine a user will assume that he can call eval() on a symbol we create.

Maybe I've been too much focused on the latin1 case where most unicode characters will be incorrectly translated, whereas the iconv translation would be fine for most CKJ users? Presumably they got UTF-8 strings that are representable in their locale encoding. This problem is hairy :s

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 20, 2017

I wonder if we should convert to native, but save the original utf-8 string in the attribute?

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 20, 2017

Yes, I think this is the safest option. (Even if the original is not UTF-8, we should convert it to UTF-8.)

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 21, 2017

on hindsight I think we should leave the symbol alone. All reencoding options have deep issues:

  • As you mentioned, converting to UTF-8 is not ideal because a user might already have created symbols with the locale encoding in an environment (e.g. with a simple assignment). If we convert to UTF-8, the symbols won't correspond.

  • Converting to the native encoding will cause havoc for western latin1 users working with UTF-8 encoded data.

But I have found some workarounds to make R remember the tag. Will post a PR soon on rlang

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 21, 2017

We must convert to the native encoding (your option 2), otherwise R won't find the variable that corresponds to the symbol. This is true especially for UTF-8 encoded data. What havoc are you expecting for western latin1 users?

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 21, 2017

I would think this is the user responsibility to convert to native if she had imported UTF-8 data. We cannot make dplyr incompatible with UTF-8 datasets for the whole latin1 Windows population...

Anyway, I found workarounds. A sane one that works 80% of the time, and an insane that works 100% of the time.

I wonder if R will take encodings into account when it performs symbol lookup? I would assume so. If that is the case, our problem is solved.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 21, 2017

@krlmlr

To be more precise, we cannot in general coerce to native because this will create corrupted strings for latin1 users. On the other hand, when it's time to evaluate, it might be reasonable to fixup the expression before evaluation and convert all UTF-8 or latin1 encoded symbols to the native encoding. I could add a C function for doing this in rlang. But this also means that data added to the evaluation environments should also be converted to native symbols before evaluation.

On the whole, this seems like a lot of trouble for a weak payoff. Most of the time the UTF-8 strings in a non-UTF-8 environment are coming from a dataset. Most likely, symbol lookups will refer to variables in that dataset and not the environment. Does this problem really need fixing?

It might still make sense to fix this issue in base R. Tomas is going to fix the encoding tag issue in R soon. While he is working on that, would it make sense to suggest to him that symbol lookup should first convert any UTF-8 encoded symbols to native encoding? This should not be very costly in ordinary situations, and this might handle most problems.

Edit: The following was wrong by the way, R does not perform any conversion, hence my suggestions above.

I wonder if R will take encodings into account when it performs symbol lookup? I would assume so. If that is the case, our problem is solved.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 21, 2017

In R, character encoding is transparent for most uses, except system API interactions (console, file system, ...) and language symbols. Once the data is in R, the user shouldn't be caring about encoding issues at all.

R converts all strings to native before coercing to symbol, please look up installTrChar() in the R sources. This is not exported, but Rf_install(Rf_translateChar(.)) is a failsafe alternative. Why should rlang behave any different?

CC @kalibera: Ideally, base R would use UTF-8 consistently to encode all symbols. But looking at the R sources and the comments there leaves little hope that we can change it without breaking existing code. (I suspect there might be code that relies on a particular encoding for installed symbol strings.) I'd be happy to discuss this with you at your convenience.

@lionel-
Copy link
Member

@lionel- lionel- commented Feb 21, 2017

R converts all strings to native before coercing to symbol

I don't think that's true. as.name() ultimately calls coerceToSymbol(), which calls installChar(). The latter just forgets about the tag and does not perform any conversion. See https://github.com/wch/r-source/blob/14ee5c17dfab4bed5bc138cc4ef70a4c1e9fb7b9/src/main/coerce.c#L394

Why should rlang behave any different?

In any case rlang and the tidyverse cannot work like this because they perform this roundtrip all the time: string → symbol → string. Converting to native is too risky because a lossy operation.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Feb 21, 2017

Moving discussion to r-lib/rlang#34, because it doesn't seem to affect this PR anymore.

hadley
hadley approved these changes Feb 21, 2017
@krlmlr krlmlr merged commit 61f77bc into tidyverse:master Feb 21, 2017
1 check was pending
@krlmlr krlmlr deleted the b-#1950-symbols branch Feb 21, 2017
@lock
Copy link

@lock lock bot commented Jan 18, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants