Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't join across sources #219

Closed
hadley opened this issue Apr 3, 2018 · 6 comments
Closed

Can't join across sources #219

hadley opened this issue Apr 3, 2018 · 6 comments
Labels
dbplyr 🔧 feature a feature request or enhancement

Comments

@hadley
Copy link
Member

hadley commented Apr 3, 2018

library(dplyr, warn.conflicts = FALSE)
library(bigrquery)

con1 <- DBI::dbConnect(
  bigrquery::dbi_driver(),
  dataset = "noaa_gsod",
  project = "bigquery-public-data",
  billing = "887175176791"
)
con2 <- DBI::dbConnect(
  bigrquery::dbi_driver(),
  dataset = "samples",
  project = "bigquery-public-data",
  billing = "887175176791"
)
tbl1 <- tbl(con1, "gsod2008")
tbl2 <- tbl(con2, "gsod")

left_join(tbl1, tbl2, by = c("year" = "year", "mo" = "month", "da" = "day"))
#>  Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow) 
left_join(tbl2, tbl1, by = c("year" = "year", "month" = "mo", "day" = "da"))
#>  Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow) 

(Part of #101)

@hadley hadley added feature a feature request or enhancement dbplyr 🔧 labels Apr 3, 2018
@hadley
Copy link
Member Author

hadley commented Apr 12, 2018

Need to adjust same_src() so it's always true for any two bigquery connections. Then ensure we always use the fully qualified table name.

Possibly related: tidyverse/dplyr#2993

@hadley
Copy link
Member Author

hadley commented Apr 17, 2018

Better approach, which works in dev version.

library(dplyr, warn.conflicts = FALSE)
library(bigrquery)

con <- DBI::dbConnect(bigrquery::dbi_driver(), project = bq_test_project())
tbl1 <- tbl(con, "bigquery-public-data.noaa_gsod.gsod2008")
tbl2 <- tbl(con, "bigquery-public-data.samples.gsod")

left_join(tbl1, tbl2, by = c("year" = "year", "mo" = "month", "da" = "day"))
left_join(tbl2, tbl1, by = c("year" = "year", "month" = "mo", "day" = "da"))

@hadley hadley closed this as completed in e878568 Apr 18, 2018
@abalter
Copy link

abalter commented Oct 2, 2020

Has this been resolved? I'm still getting the error. Is this an adequate reprex?

library(magrittr)
library(ggplot2)
library(bigrquery)
library(tidyverse)
library(dplyr, warn.conflicts = FALSE)
library(knitr)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

conn = DBI::dbConnect(
  bigrquery::bigquery(),
  project = "elite-magpie-257717",
  dataset = "TEST_DATASET",
  KeyFilePath = "google_service_key.json",
  OAuthMechanism = 0
)

df = data.frame(
  geo_id = c("US", "CA"),
  socialization = c("Obnoxious", "Polite")
)

DBI::dbWriteTable(
  conn=conn, 
  name="mytable", 
  value=df, 
  overwrite=T
)
#> Using an auto-discovered, cached token.
#> To suppress this message, modify your code or options to clearly consent to the use of a cached token.
#> See gargle's "Non-interactive auth" vignette for more details:
#> https://gargle.r-lib.org/articles/non-interactive-auth.html
#> The bigrquery package is using a cached token for ariel.balter@gmail.com.

country_data = tbl(conn, "mytable")

covid_conn = DBI::dbConnect(
  bigrquery::bigquery(),
  project = "bigquery-public-data",
  dataset = "covid19_ecdc"
)

covid_data = 
  tbl(covid_conn, "covid_19_geographic_distribution_worldwide") %>%
  select(geo_id, pop_data_2019)

inner_join(country_data, covid_data, by="geo_id")
#> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow).

Created on 2020-10-02 by the reprex package (v0.3.0)

@hadley
Copy link
Member Author

hadley commented Oct 2, 2020

@abalter you need to use the same connection object

@abalter
Copy link

abalter commented Oct 2, 2020

But those are different data sources. One is a google-public-dataset. One is mine.

@abalter
Copy link

abalter commented Oct 4, 2020

Ok, I think I see what you mean. I can both create a connection without specifying a dataset, and also access a table outside of the project associated with the connection with a fully qualified path. This was not obvious.

library(bigrquery)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = bq_test_project()
)


country_data  = tbl(con, "elite-magpie-257717.TEST_DATASET.mytable")
#> Using an auto-discovered, cached token.
#> To suppress this message, modify your code or options to clearly consent to the use of a cached token.
#> See gargle's "Non-interactive auth" vignette for more details:
#> https://gargle.r-lib.org/articles/non-interactive-auth.html
#> The bigrquery package is using a cached token for ariel.balter@gmail.com.

covid_data = 
  tbl(con, "bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide") %>%
  select(geo_id, pop_data_2019) %>%
  distinct()

inner_join(country_data, covid_data, by="geo_id") %>% head(10)
#> Warning: `...` is not empty.
#> 
#> We detected these problematic arguments:
#> * `needs_dots`
#> 
#> These dots only exist to allow future extensions and should be empty.
#> Did you misspecify an argument?
#> # Source:   lazy query [?? x 3]
#> # Database: BigQueryConnection
#>   socialization geo_id pop_data_2019
#>   <chr>         <chr>          <int>
#> 1 Obnoxious     US         329064917
#> 2 Polite        CA          37411038

Created on 2020-10-04 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbplyr 🔧 feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants