Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding support for CROSS JOIN #2924

Closed
javierluraschi opened this issue Jun 28, 2017 · 6 comments
Closed

Consider adding support for CROSS JOIN #2924

javierluraschi opened this issue Jun 28, 2017 · 6 comments
Labels
feature a feature request or enhancement wip work in progress

Comments

@javierluraschi
Copy link
Contributor

CROSS JOIN is unsupported under full_join.

memdb

d1_mem <- memdb_frame(a = 1:5, .name = "d1")
d2_mem <- memdb_frame(b = 1:5, .name = "d2")
d1_mem %>% full_join(d2_mem)

Triggers:

Error: `by` required, because the data sources have no common variables

However,

tbl(src_memdb(), sql("SELECT * FROM d1 CROSS JOIN d2"))
# Source:   SQL [?? x 2]
# Database: sqlite 3.11.1 [:memory:]
       a `a:1`
   <int> <int>
 1     1     1
 2     1     2
 3     1     3
 4     1     4
 5     1     5
 6     2     1
 7     2     2
 8     2     3
 9     2     4
10     2     5
# ... with more rows

sparklyr

library(sparklyr)
library(dplyr)

spark_install(version = "2.1.0")
sc <- spark_connect(master = "local", version = "2.1.0")

d1_tbl <- copy_to(sc, data.frame(a = 1:5), name = "d1", overwrite = TRUE)
d2_tbl <- copy_to(sc, data.frame(b = 1:5), name = "d2", overwrite = TRUE)

d1_tbl %>% full_join(d2_tbl)

Triggers,

Error: `by` required, because the data sources have no common variables

See sparklyr/sparklyr#771

@ianmcook
Copy link
Contributor

Wouldn't it be better to add a new cross_join() two-table verb instead of implementing this through full_join()? An accidental Cartesian join from failure to properly specify join conditions can return huge amounts of data.

@javierluraschi
Copy link
Contributor Author

javierluraschi commented Jun 29, 2017

@ianmcook Yeah, I agree, implementing cross_join() would be a better approach. We could also add a better error in full_join() for this case:

"No overlapping variables to perform full_join(), use cross_join() instead."

However, if we go this route, I think we also need to implement cross_join for DataFrames; would be inconsistent to add a new function that is supported only in database backends.

Is someone willing/able/interested in implementing cross_join for DataFrames already? If not, I might try to give it a shot in a week or two.

@JohnMount
Copy link

JohnMount commented Jun 30, 2017

I like the idea of a cross_join() verb in addition to the strict full_join verb (it documents intent). Also couldn't dplyr have a nice "in terms of other verbs" default implementation such as the following? Then only backends with additional issues have to override the implementation, and we have the implementation in terms of already trusted verbs.

suppressPackageStartupMessages(library(dplyr))
a <- data.frame(x = 1:3)
b <- data.frame(y= 1:2)

full_join(mutate(a, FAKECOL=1), mutate(b, FAKECOL=1), by='FAKECOL') %>% 
  select(-FAKECOL)
#>   x y
#> 1 1 1
#> 2 1 2
#> 3 2 1
#> 4 2 2
#> 5 3 1
#> 6 3 2

Obviously there are a few more steps to pick a fake join condition column at run time. The method I have been using to generate names is:

setdiff(paste("FC", 
              seq_len(1 + length(c(colnames(a), colnames(b)))), 
              sep = '_'), 
        c(colnames(a), colnames(b)))[[1]]

@RoyalTS
Copy link

RoyalTS commented Jul 10, 2017

Part of more general #2240

@hadley hadley added database feature a feature request or enhancement labels Aug 23, 2017
@hadley hadley added the wip work in progress label Oct 23, 2017
hadley pushed a commit to tidyverse/dbplyr that referenced this issue Oct 26, 2017
CROSS JOIN will be generated whenever by = character()

Fixes tidyverse/dplyr#2924
@jarauh
Copy link

jarauh commented Dec 19, 2017

Will this also work in dplyr or only in dbplyr?

@batpigandme
Copy link
Contributor

This is just a translation of dplyr's full_join() for SQL queries as CROSS JOIN. You can see the changes here: tidyverse/dbplyr@3316463

@lock lock bot locked as resolved and limited conversation to collaborators Jun 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement wip work in progress
Projects
None yet
Development

No branches or pull requests

7 participants