Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joins match rows on NA across datasets #2033

Closed
garrettgman opened this issue Jul 21, 2016 · 11 comments
Closed

joins match rows on NA across datasets #2033

garrettgman opened this issue Jul 21, 2016 · 11 comments
Assignees
Labels
Milestone

Comments

@garrettgman
Copy link
Member

@garrettgman garrettgman commented Jul 21, 2016

Below, inner_join() matches two rows that each have an NA for the key. This seems to violate the intuition about NA's (and leads to an incorrect result).

This seems to happen for all of the joins.

library(dplyr)
​
songs <- data_frame(song = c("Do-Re-Mi", "A Spoonful of Sugar"), movie = c("The Sound of Music", NA))
songs
##                  song              movie
##                 <chr>              <chr>
## 1            Do-Re-Mi The Sound of Music
## 2 A Spoonful of Sugar               <NA>
​
singers <- data_frame(movie = c(NA, "The Sound of Music"), singer = c("Arnold Schwarzenegger", "Julie Andrews"))
singers
##                movie                singer
##                <chr>                 <chr>
## 1               <NA> Arnold Schwarzenegger
## 2 The Sound of Music         Julie Andrews
​
songs %>% inner_join(singers, by = "movie")
##                  song              movie                singer
##                 <chr>              <chr>                 <chr>
## 1            Do-Re-Mi The Sound of Music         Julie Andrews
## 2 A Spoonful of Sugar               <NA> Arnold Schwarzenegger
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 7, 2016

Thanks. This is clearly something that SQL does differently:

memdb_frame(a=c(1, NA), b = 1:2) %>% left_join(memdb_frame(a = c(1, NA), c = 2:3))

@hadley: Can we safely change the NA behavior for joins?

@hadley
Copy link
Member

@hadley hadley commented Nov 7, 2016

Yeah, an NA should never match another NA.

I have a vague memory that this has come up in the past, and some issue may have yielded the current behaviour.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 29, 2016

See also examples in #2272.

@krlmlr krlmlr self-assigned this Feb 10, 2017
@krlmlr krlmlr added this to the data frame 2 milestone Feb 20, 2017
@krlmlr krlmlr added this to the data frame 2 milestone Feb 20, 2017
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 8, 2017

This is fixable, but getting it right for all data types in all combinations will take some time.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 8, 2017

@hadley: Should NaN match another NaN? We have is.na(NaN) in R, but current dplyr code considers NA_real_ != NaN but NA_real_ == NA_real_ and NaN == NaN.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 8, 2017

Joins on NA strings were deliberate at some point (#892), and are tested explicitly. This might be a breaking change unless we introduce some kind of compatibility argument. I still think we shouldn't treat NA as equal for joining.

@hadley
Copy link
Member

@hadley hadley commented Mar 9, 2017

I'd say NaN shouldn't match anything (like NA), and the string issue must've been= confusion on my part.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 9, 2017

Joins are also used to test data frame equality. I'll add a template parameter to control this.

@hadley
Copy link
Member

@hadley hadley commented Mar 9, 2017

Ah, I bet that's why

krlmlr added a commit to krlmlr/dplyr that referenced this issue Mar 19, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Mar 20, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Mar 21, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Mar 21, 2017
@krlmlr krlmlr closed this in #2517 Mar 21, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Mar 21, 2017
@iangow
Copy link

@iangow iangow commented Mar 21, 2017

Yeah, an NA should never match another NA.

But SQL's EXCEPT behaves as though one is treating NA as matching NA in a join. See #2527. I guess the question is whether this kind of behaviour should be the result of some verb and, if so, whether anti_join (as opposed to, say, "except") is that verb.

@iangow
Copy link

@iangow iangow commented Mar 21, 2017

Oh. I now see that setdiff is the verb that produces the desired behaviour.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants