Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins are too strict about type comparison #228

Closed
hadley opened this issue Jan 30, 2014 · 11 comments
Closed

Joins are too strict about type comparison #228

hadley opened this issue Jan 30, 2014 · 11 comments
Assignees
Labels
Milestone

Comments

@hadley
Copy link
Member

@hadley hadley commented Jan 30, 2014

In this case it would be nice if the integers were upclassed to numeric.

df <- data.frame(
  V1 = c(rep("a",5), rep("b",5)),
  V2 = rep(c(1:5), 2),
  V3 = c(101:110),
  stringsAsFactors = FALSE
)

match <- data.frame(
  V1 = c("a", "b"),
  V2 = c(3, 4),
  stringsAsFactors = FALSE
)

library(dplyr)
semi_join(df, match)
@hadley
Copy link
Member Author

@hadley hadley commented Feb 19, 2014

Similarly, factors should be upclassed to strings.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 25, 2014

This means we need to upgrade join visitiors from:

template <int RTYPE> class JoinVisitorImpl

to

template <int LHS_RTYPE, int LHS_RTYPE> class JoinVisitorImpl

Probably tedious but not hard. The dispatcher (i.e. inline JoinVisitor* join_visitor) will need some more logic to accomodate all cases.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 25, 2014

In progress but I can now handle the case where one variable is REALSXP and the other is INTSXP or LGLSXP.

Next:

  • INTSXP / LGLSXP
  • factor / string and string / factor
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 26, 2014

I think I'm covering most cases now:

  • integer or logical can be coerced to numeric
  • logical can be coerced to integer
  • factor can be converted to string.

I've implemented this without materializing when not necessary, e.g. when I have a factor and a character vector, I don't first convert the factor to a character vector. This is achieved by a richer set of classes that derive from JoinVisitor.

This probably needs some testing.

One question though, in the following case (your initial example):

df <- data.frame(
  V1 = c(rep("a",5), rep("b",5)),
  V2 = rep(c(1:5), 2),
  V3 = c(101:110),
  stringsAsFactors = FALSE
)

match <- data.frame(
  V1 = c("a", "b"),
  V2 = c(3, 4),
  stringsAsFactors = FALSE
)
res <- semi_join(df, match)

Would you expect res$V2 to be numeric or integer ? At the moment because of how this is implemented internally, we get integer, i.e. we train some data structures which does the right thing for hashing consistently integer and numeric, but when we have our indices, we just subset from df so we get the same type as in df.

For other joins (e.g. right_join), the subsetting is based on DataFrameJoinVisitor so we get the promoted classes.

It might be better if we always get the promoted class ?

romainfrancois added a commit that referenced this issue Mar 26, 2014
@hadley
Copy link
Member Author

@hadley hadley commented Mar 26, 2014

For a semi-join, I'd expect that it acts like a filter, only affecting the rows, not the columns. So the current behaviour sounds right to me.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 31, 2014

This is now fixed.

@otsaw
Copy link

@otsaw otsaw commented Apr 1, 2014

NAs remain a problem. While NA is technically logical, it should be promotable to anything. At least I'm not used to using NA_character_ etc. in code. In your implementation, can you check for all(is.na(x)) or would you need to promote TRUE and FALSE as well?

> left_join(data.frame(x="foo"), data.frame(x=NA, y=TRUE))
Joining by: "x"
Error: Can't join on 'x' because of incompatible types (logical/factor)
> left_join(data.frame(x="foo", stringsAsFactors=FALSE), data.frame(x=NA, y=TRUE))
Joining by: "x"
Error: Can't join on 'x' because of incompatible types (logical/character)

For what it's worth, this is no big deal for me personally as I use a wrapper around left_join. Doing type promotions in R code before joining doesn't bring a big speed penalty, except in rbind (issue #321) where it requires a loop.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 1, 2014

doing all(is.na(x)) for every logical vector is going to be expensive. I'd rather keep what we have and let the user explicitely call as.character in that case.

Unless we allow logical and character to be joined, I think the current behavior is fine.

@otsaw
Copy link

@otsaw otsaw commented Apr 1, 2014

I think it's worth considering to allow logical to be joined with factor and character. It's hardly useful outside the NA case, but it's also consistent with base R, so I doubt it would be a surprise or a problem for users.

> merge(data.frame(x=TRUE), data.frame(x="TRUE", y=1))
     x y
1 TRUE 1
@hadley
Copy link
Member Author

@hadley hadley commented Apr 1, 2014

@otsaw I am strongly against automatically coercing logical to factor/character. Learning to use NA_character_ is a small price to pay for safer joins. This problem will also be alleviated once we have better ways to describe correct types during data input (e.g. https://github.com/romainfrancois/fastread)

@JackStat
Copy link

@JackStat JackStat commented May 12, 2014

I apologize in advance for not having a reproducible example but I could reproduce the error outside of my person data sets. I am having issues with this error.

> Sub<-
+   jkw2 %.%
+   left_join(wps2, by='ID')
Error: Can't join on 'ID' because of incompatible types (numeric/integer)

I thought maybe one of them was character but that is not the case

> is.numeric(wps2$ID)
[1] TRUE
> is.numeric(jkw2$ID)
[1] TRUE

I am also able to join other tables.

> Sub<-
+   jkw2 %.%
+ #   left_join(wps2, by='ID') %.%
+   left_join(FS, by='ID') %.%
+   left_join(infect, by='ID')
>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants