Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ascii column names in version 0.3 are duplicated #636

Closed
hmalmedal opened this issue Sep 26, 2014 · 7 comments
Closed

Non-ascii column names in version 0.3 are duplicated #636

hmalmedal opened this issue Sep 26, 2014 · 7 comments
Assignees
Labels
Milestone

Comments

@hmalmedal
Copy link

@hmalmedal hmalmedal commented Sep 26, 2014

I try to run mutate_each with version 0.3, but columns with non-ascii names are duplicated.

data_frame(a = "1", å = "2") %>% mutate_each(funs(as.numeric)) %>% str
## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  3 variables:
##  $ a: num 1
##  $ å: chr "2"
##  $ å: num 2
@hadley
Copy link
Member

@hadley hadley commented Sep 26, 2014

And did this happen with dplyr 0.2?

@hmalmedal
Copy link
Author

@hmalmedal hmalmedal commented Sep 27, 2014

No. Here's what happens with 0.2:

data.frame(a = "1", å = "2", stringsAsFactors = FALSE) %>%
  mutate_each(funs(as.numeric)) %>% str
## 'data.frame':    1 obs. of  2 variables:
##  $ a: num 1
##  $ å: num 2

@hadley hadley added the bug label Sep 27, 2014
@hadley hadley added this to the 0.3 milestone Sep 27, 2014
@hadley
Copy link
Member

@hadley hadley commented Sep 27, 2014

@romainfrancois can you take a look please?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 30, 2014

This also happens with a direct mutatecall:

> data_frame(a = "1", å = "2") %>% mutate( a = as.numeric(a), å = as.numeric(å) )
Source: local data frame [1 x 3]

  a å å
1 1 2 2

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 30, 2014

Related to this:

> Encoding( "å" )
[1] "UTF-8"
> d <- data.frame( å = 2 )
> Encoding( names(d) )
[1] "unknown"

Not sure what to do.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 30, 2014

I've added some care about encoding in the NamedListAccumulator class. So now we get:

> data.frame(å = "2", stringsAsFactors = FALSE) %>% mutate_each( funs(as.numeric) )
Erreur : cannot compare two strings of different encodings: unknown/UTF-8

Still need a way to compare and hash R strings independently of their encoding.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 30, 2014

Ok. I was about to send this to R-devel:

Hello, 

Some warm up motivation: 

> Encoding( "å" )
[1] "UTF-8"
> d <- data.frame( å = 2 )
> Encoding( names(d) )
[1] "unknown"

I’m just wondering how would one test if two CHARSXP are equal, i.e. how to get this result: 

> a <- "å"
> b <- names(d)
> a == b
[1] TRUE

… using the C api, and if possible the part of the C api that is allowed. 

Can’t use the CHARSXP cache: 

> Rcpp::cppFunction( 'bool eqstring( SEXP a, SEXP b){ return STRING_ELT(a,0) == STRING_ELT(b,0) ; }')
> eqstring( a, b )
[1] FALSE

Those are two different CHARSXP SEXP: 

> Rcpp::cppFunction( 'void showme(SEXP s){ Rprintf("p = <%p>\\n", STRING_ELT(s,0)) ; } ' )
> showme(a)
p = <0x7fc96cd76ca8>
> showme(b)
p = <0x7fc96cd76bb8>

So two different pointers which makes sense because of the difference of encodings. Fair enough. 

Of course, I can collect these two strings into two character vector and call back to R to do the job, something like this: 

bool compare( SEXP a, SEXP b){
   SEXP sa = PROTECT( Rf_allocVector(STRSXP, 1) ) ;
   SET_STRING_ELT(sa, 0, a) ;

   SEXP sb = PROTECT( Rf_allocVector(STRSXP, 1) ) ;
   SET_STRING_ELT(sb, 0, b) ;

   SEXP call = PROTECT( Rf_lang3( Rf_install("=="), sa, sb ) ) ;
   SEXP res  = PROTECT( Rf_eval( call, R_GlobalEnv ) ) ;

   bool result = LOGICAL(res)[0] ;
   UNPROTECT(4) ;
   return result ;   

}

But that’s just a waste of everything. 

So the question is, how can I compare two CHARSXP using the C api ?
Ok, let me rephrase the question. Why is Seql hidden ?

Romain

However, looking at Seql :

/* this has NA_STRING = NA_STRING */
attribute_hidden
int Seql(SEXP a, SEXP b)
{
    /* The only case where pointer comparisons do not suffice is where
      we have two strings in different encodings (which must be
      non-ASCII strings). Note that one of the strings could be marked
      as unknown. */
    if (a == b) return 1;
    /* Leave this to compiler to optimize */
    if (IS_CACHED(a) && IS_CACHED(b) && ENC_KNOWN(a) == ENC_KNOWN(b))
    return 0;
    else {
        SEXP vmax = R_VStack;
        int result = !strcmp(translateCharUTF8(a), translateCharUTF8(b));
        R_VStack = vmax; /* discard any memory used by translateCharUTF8 */
        return result;
    }
}

it looks like R implements string comparison by:

  • first try direct pointer comparison (leveraging the cache)
  • otherwise convert both to utf-8 and compare that

The second thing uses translateCharUTF8 which is not forbidden apparently. But this comment in the source:

R_VStack = vmax; /* discard any memory used by translateCharUTF8 */

is worrying. Esp because R_VStack is static in memory.c so unusable.

Anyway, I'll put a workaround in place. Wait for Utf8String and Utf8CharacterVector classes or something with a less ugly name.

romainfrancois added a commit that referenced this issue Sep 30, 2014
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants