Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

JOIN by different column names #177

Closed
hs3180 opened this Issue Jan 10, 2014 · 11 comments

Comments

Projects
None yet
6 participants
Contributor

hs3180 commented Jan 10, 2014

use a list of vectors to describe by columns?

Owner

hadley commented Jan 10, 2014

Joining by different columns in the x and y tbls? I was thinking of using a named vector like join(x, y, by = c("x1" = "y1"))

Contributor

hs3180 commented Jan 22, 2014

How about conditions like "x1 > y1" ?

Owner

hadley commented Jan 22, 2014

One day, but it will need a more flexible syntax than existing join functions in dplyr.

This feature is really needed. Self joins (among other operations) get very weird, confusing, and inefficient without it...

Owner

hadley commented Feb 11, 2014

@deaneckles would you mind sending me some examples of self-joins that you've used? I'm trying to get my head around what the common operations are.

nacnudus commented Apr 6, 2014

Not sure if this is what you mean, but self-joining is common when a network is described by relational data. For example, a family.

Name   | Mother | Father
=======|========|=======
Philip | Teresa | Mathew
Mickey | Sophie | Philip
Sophie | Alexis | Steven

To find Sophie's son, you self-join on "mother" == "name". To find Mickey's parents, you self-join twice on "name" = "mother" or "name" = "father".

I find it more often in normalised data where, say, a court hearing has two columns litigant and respondent in relation to the same parties table. In this case, the columns must be renamed twice, once for each relation. You could argue that denormalization is what databases are for, not dplyr, but then you could also argue that dplyr is meant to save the data analyst from having to learn yet another SQL dialect.

@hadley hadley added the enhancement label Aug 1, 2014

@hadley hadley self-assigned this Aug 1, 2014

Owner

hadley commented Aug 1, 2014

Implemented for sql tables in 5fa3734

@hadley hadley assigned romainfrancois and unassigned hadley Aug 12, 2014

Owner

hadley commented Aug 12, 2014

@romainfrancois can you please take a look at implementing this in the internal join methods - if by is named, the columns for x come from the names.

Owner

hadley commented Sep 12, 2014

Done :D

@hadley hadley closed this Sep 12, 2014

I expected the following to work... what am I missing?

x <- data.frame(x1 = c(1, 3, 2))
y <- data.frame(y1 = c(1, 2, 3),
           y2 = c("foo", "foo", "bar"))
left_join(x, y, by = c("x1" = "y1"))
# Error: cannot join on columns 'x1' x 'y1'
Owner

hadley commented Sep 22, 2014

@kykuo could you please file this as a new bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment