New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins are too strict about type comparison #228

Closed
hadley opened this Issue Jan 30, 2014 · 11 comments

Comments

Projects
None yet
4 participants
@hadley
Member

hadley commented Jan 30, 2014

In this case it would be nice if the integers were upclassed to numeric.

df <- data.frame(
  V1 = c(rep("a",5), rep("b",5)),
  V2 = rep(c(1:5), 2),
  V3 = c(101:110),
  stringsAsFactors = FALSE
)

match <- data.frame(
  V1 = c("a", "b"),
  V2 = c(3, 4),
  stringsAsFactors = FALSE
)

library(dplyr)
semi_join(df, match)
@hadley

This comment has been minimized.

Member

hadley commented Feb 19, 2014

Similarly, factors should be upclassed to strings.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 25, 2014

This means we need to upgrade join visitiors from:

template <int RTYPE> class JoinVisitorImpl

to

template <int LHS_RTYPE, int LHS_RTYPE> class JoinVisitorImpl

Probably tedious but not hard. The dispatcher (i.e. inline JoinVisitor* join_visitor) will need some more logic to accomodate all cases.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 25, 2014

In progress but I can now handle the case where one variable is REALSXP and the other is INTSXP or LGLSXP.

Next:

  • INTSXP / LGLSXP
  • factor / string and string / factor
@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 26, 2014

I think I'm covering most cases now:

  • integer or logical can be coerced to numeric
  • logical can be coerced to integer
  • factor can be converted to string.

I've implemented this without materializing when not necessary, e.g. when I have a factor and a character vector, I don't first convert the factor to a character vector. This is achieved by a richer set of classes that derive from JoinVisitor.

This probably needs some testing.

One question though, in the following case (your initial example):

df <- data.frame(
  V1 = c(rep("a",5), rep("b",5)),
  V2 = rep(c(1:5), 2),
  V3 = c(101:110),
  stringsAsFactors = FALSE
)

match <- data.frame(
  V1 = c("a", "b"),
  V2 = c(3, 4),
  stringsAsFactors = FALSE
)
res <- semi_join(df, match)

Would you expect res$V2 to be numeric or integer ? At the moment because of how this is implemented internally, we get integer, i.e. we train some data structures which does the right thing for hashing consistently integer and numeric, but when we have our indices, we just subset from df so we get the same type as in df.

For other joins (e.g. right_join), the subsetting is based on DataFrameJoinVisitor so we get the promoted classes.

It might be better if we always get the promoted class ?

romainfrancois added a commit that referenced this issue Mar 26, 2014

@hadley

This comment has been minimized.

Member

hadley commented Mar 26, 2014

For a semi-join, I'd expect that it acts like a filter, only affecting the rows, not the columns. So the current behaviour sounds right to me.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 31, 2014

This is now fixed.

@otsaw

This comment has been minimized.

otsaw commented Apr 1, 2014

NAs remain a problem. While NA is technically logical, it should be promotable to anything. At least I'm not used to using NA_character_ etc. in code. In your implementation, can you check for all(is.na(x)) or would you need to promote TRUE and FALSE as well?

> left_join(data.frame(x="foo"), data.frame(x=NA, y=TRUE))
Joining by: "x"
Error: Can't join on 'x' because of incompatible types (logical/factor)
> left_join(data.frame(x="foo", stringsAsFactors=FALSE), data.frame(x=NA, y=TRUE))
Joining by: "x"
Error: Can't join on 'x' because of incompatible types (logical/character)

For what it's worth, this is no big deal for me personally as I use a wrapper around left_join. Doing type promotions in R code before joining doesn't bring a big speed penalty, except in rbind (issue #321) where it requires a loop.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Apr 1, 2014

doing all(is.na(x)) for every logical vector is going to be expensive. I'd rather keep what we have and let the user explicitely call as.character in that case.

Unless we allow logical and character to be joined, I think the current behavior is fine.

@otsaw

This comment has been minimized.

otsaw commented Apr 1, 2014

I think it's worth considering to allow logical to be joined with factor and character. It's hardly useful outside the NA case, but it's also consistent with base R, so I doubt it would be a surprise or a problem for users.

> merge(data.frame(x=TRUE), data.frame(x="TRUE", y=1))
     x y
1 TRUE 1
@hadley

This comment has been minimized.

Member

hadley commented Apr 1, 2014

@otsaw I am strongly against automatically coercing logical to factor/character. Learning to use NA_character_ is a small price to pay for safer joins. This problem will also be alleviated once we have better ways to describe correct types during data input (e.g. https://github.com/romainfrancois/fastread)

@JackStat

This comment has been minimized.

JackStat commented May 12, 2014

I apologize in advance for not having a reproducible example but I could reproduce the error outside of my person data sets. I am having issues with this error.

> Sub<-
+   jkw2 %.%
+   left_join(wps2, by='ID')
Error: Can't join on 'ID' because of incompatible types (numeric/integer)

I thought maybe one of them was character but that is not the case

> is.numeric(wps2$ID)
[1] TRUE
> is.numeric(jkw2$ID)
[1] TRUE

I am also able to join other tables.

> Sub<-
+   jkw2 %.%
+ #   left_join(wps2, by='ID') %.%
+   left_join(FS, by='ID') %.%
+   left_join(infect, by='ID')
>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.