Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_join character and factor #455

Closed
kismsu opened this issue Jun 9, 2014 · 10 comments
Closed

inner_join character and factor #455

kismsu opened this issue Jun 9, 2014 · 10 comments
Assignees
Labels
Milestone

Comments

@kismsu
Copy link

@kismsu kismsu commented Jun 9, 2014

I've noticed that if you join on column which is character in one table and factor in another, you're getting unstable results. Some records match, some not. Should the function return an error, or at least a warning, that your columns have different type?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jun 11, 2014

Can you spare some reproducible example please.

Loading

@rickyars
Copy link

@rickyars rickyars commented Jun 11, 2014

Possibly related to issue #450? Here's an example:

library(dplyr)

foo <- data.frame(id = letters, var1 = "foo", stringsAsFactors=FALSE)
bar <- data.frame(id = rep(letters, 2), var2 = "bar")

this doesn't work:

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

however using merge works just fine:

tmp3 <- merge(foo, bar, by="id")
tmp4 <- merge(bar, foo, by="id")

what's even weirder is what happens when you switch who has the factor variable:

foo <- data.frame(id = letters, var1 = "foo")
bar <- data.frame(id = rep(letters, 2), var2 = "bar", stringsAsFactors=FALSE)

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

Loading

@rickyars
Copy link

@rickyars rickyars commented Jun 11, 2014

Here's an even smaller example:

foo <- data.frame(id = c("a", "b"), var1 = "foo")
bar <- data.frame(id = c("a", "b"), var2 = "bar", stringsAsFactors=FALSE)

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

foo <- data.frame(id = c("a", "b"), var1 = "foo", stringsAsFactors=FALSE)
bar <- data.frame(id = c("a", "b"), var2 = "bar")

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

Loading

@romainfrancois romainfrancois self-assigned this Jun 11, 2014
@kismsu
Copy link
Author

@kismsu kismsu commented Jun 11, 2014

Yep, the same

Loading

@hadley
Copy link
Member

@hadley hadley commented Sep 12, 2014

And here's a test

test_that("inner_join is symmetric (even when joining on character & factor)", {
  foo <- data_frame(id = c("a", "b"), var1 = factor("foo"))
  bar <- data_frame(id = c("a", "b"), var2 = "bar")

  tmp1 <- inner_join(foo, bar, by="id")
  tmp2 <- inner_join(bar, foo, by="id")

  expect_is(tmp1$var1, "character")
  expect_is(tmp2$var1, "character")
  expect_equal(names(tmp1), c("id", "var1", "var2"))
  expect_equal(names(tmp2), c("id", "var2", "var1"))

  expect_equal(tmp1, tmp2)
})

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 16, 2014

I don't get it. I think it is perfectly normal that:

> str(tmp1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2 obs. of  3 variables:
 $ id  : chr  "a" "b"
 $ var1: Factor w/ 1 level "foo": 1 1
 $ var2: chr  "bar" "bar"
> str(tmp2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2 obs. of  3 variables:
 $ id  : chr  "a" "b"
 $ var2: chr  "bar" "bar"
 $ var1: Factor w/ 1 level "foo": 1 1

Perhaps

foo <- data_frame(id = factor(c("a", "b")), var1 = "foo")
bar <- data_frame(id = c("a", "b"), var2 = "bar")

which indeed gives something wrong:

> str(tmp1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2 obs. of  3 variables:
 $ id  : Factor w/ 2 levels "a","b": 1 2
 $ var1: chr  "foo" "foo"
 $ var2: chr  "bar" "bar"
> str(tmp2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  3 variables:
 $ id  : chr "a"
 $ var2: chr "bar"
 $ var1: chr "foo"

Loading

@hadley
Copy link
Member

@hadley hadley commented Sep 16, 2014

@romainfrancois oh oops, yeah, I think I put factor around the wrong variable

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 20, 2014

I think it's ok now, at least according to the test I put in place here;
https://github.com/hadley/dplyr/blob/master/tests/testthat/test-joins.r#L195

Loading

@spymark
Copy link

@spymark spymark commented Sep 22, 2014

Hi Romain, I think you have left a couple of data_frame() function calls, instead of data.frame(). It's in the test you added (lines 196- 197).

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 22, 2014

That is intended. data_frame is much nicer to use.

Loading

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants