Can't join across sources #219

hadley · 2018-04-03T21:45:46Z

library(dplyr, warn.conflicts = FALSE)
library(bigrquery)

con1 <- DBI::dbConnect(
  bigrquery::dbi_driver(),
  dataset = "noaa_gsod",
  project = "bigquery-public-data",
  billing = "887175176791"
)
con2 <- DBI::dbConnect(
  bigrquery::dbi_driver(),
  dataset = "samples",
  project = "bigquery-public-data",
  billing = "887175176791"
)
tbl1 <- tbl(con1, "gsod2008")
tbl2 <- tbl(con2, "gsod")

left_join(tbl1, tbl2, by = c("year" = "year", "mo" = "month", "da" = "day"))
#>  Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow) 
left_join(tbl2, tbl1, by = c("year" = "year", "month" = "mo", "day" = "da"))
#>  Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)

(Part of #101)

hadley · 2018-04-12T23:47:16Z

Need to adjust same_src() so it's always true for any two bigquery connections. Then ensure we always use the fully qualified table name.

Possibly related: tidyverse/dplyr#2993

hadley · 2018-04-17T22:55:12Z

Better approach, which works in dev version.

library(dplyr, warn.conflicts = FALSE)
library(bigrquery)

con <- DBI::dbConnect(bigrquery::dbi_driver(), project = bq_test_project())
tbl1 <- tbl(con, "bigquery-public-data.noaa_gsod.gsod2008")
tbl2 <- tbl(con, "bigquery-public-data.samples.gsod")

left_join(tbl1, tbl2, by = c("year" = "year", "mo" = "month", "da" = "day"))
left_join(tbl2, tbl1, by = c("year" = "year", "month" = "mo", "day" = "da"))

abalter · 2020-10-02T17:08:01Z

Has this been resolved? I'm still getting the error. Is this an adequate reprex?

library(magrittr)
library(ggplot2)
library(bigrquery)
library(tidyverse)
library(dplyr, warn.conflicts = FALSE)
library(knitr)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

conn = DBI::dbConnect(
  bigrquery::bigquery(),
  project = "elite-magpie-257717",
  dataset = "TEST_DATASET",
  KeyFilePath = "google_service_key.json",
  OAuthMechanism = 0
)

df = data.frame(
  geo_id = c("US", "CA"),
  socialization = c("Obnoxious", "Polite")
)

DBI::dbWriteTable(
  conn=conn, 
  name="mytable", 
  value=df, 
  overwrite=T
)
#> Using an auto-discovered, cached token.
#> To suppress this message, modify your code or options to clearly consent to the use of a cached token.
#> See gargle's "Non-interactive auth" vignette for more details:
#> https://gargle.r-lib.org/articles/non-interactive-auth.html
#> The bigrquery package is using a cached token for ariel.balter@gmail.com.

country_data = tbl(conn, "mytable")

covid_conn = DBI::dbConnect(
  bigrquery::bigquery(),
  project = "bigquery-public-data",
  dataset = "covid19_ecdc"
)

covid_data = 
  tbl(covid_conn, "covid_19_geographic_distribution_worldwide") %>%
  select(geo_id, pop_data_2019)

inner_join(country_data, covid_data, by="geo_id")
#> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow).

^{Created on 2020-10-02 by the reprex package (v0.3.0)}

hadley · 2020-10-02T18:00:37Z

@abalter you need to use the same connection object

abalter · 2020-10-02T18:07:18Z

But those are different data sources. One is a google-public-dataset. One is mine.

abalter · 2020-10-04T08:12:17Z

Ok, I think I see what you mean. I can both create a connection without specifying a dataset, and also access a table outside of the project associated with the connection with a fully qualified path. This was not obvious.

library(bigrquery)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = bq_test_project()
)


country_data  = tbl(con, "elite-magpie-257717.TEST_DATASET.mytable")
#> Using an auto-discovered, cached token.
#> To suppress this message, modify your code or options to clearly consent to the use of a cached token.
#> See gargle's "Non-interactive auth" vignette for more details:
#> https://gargle.r-lib.org/articles/non-interactive-auth.html
#> The bigrquery package is using a cached token for ariel.balter@gmail.com.

covid_data = 
  tbl(con, "bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide") %>%
  select(geo_id, pop_data_2019) %>%
  distinct()

inner_join(country_data, covid_data, by="geo_id") %>% head(10)
#> Warning: `...` is not empty.
#> 
#> We detected these problematic arguments:
#> * `needs_dots`
#> 
#> These dots only exist to allow future extensions and should be empty.
#> Did you misspecify an argument?
#> # Source:   lazy query [?? x 3]
#> # Database: BigQueryConnection
#>   socialization geo_id pop_data_2019
#>   <chr>         <chr>          <int>
#> 1 Obnoxious     US         329064917
#> 2 Polite        CA          37411038

^{Created on 2020-10-04 by the reprex package (v0.3.0)}

hadley added feature a feature request or enhancement dbplyr 🔧 labels Apr 3, 2018

hadley closed this as completed in e878568 Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't join across sources #219

Can't join across sources #219

hadley commented Apr 3, 2018 •

edited

Loading

hadley commented Apr 12, 2018

hadley commented Apr 17, 2018 •

edited

Loading

abalter commented Oct 2, 2020

hadley commented Oct 2, 2020

abalter commented Oct 2, 2020

abalter commented Oct 4, 2020 •

edited

Loading

Can't join across sources #219

Can't join across sources #219

Comments

hadley commented Apr 3, 2018 • edited Loading

hadley commented Apr 12, 2018

hadley commented Apr 17, 2018 • edited Loading

abalter commented Oct 2, 2020

hadley commented Oct 2, 2020

abalter commented Oct 2, 2020

abalter commented Oct 4, 2020 • edited Loading

hadley commented Apr 3, 2018 •

edited

Loading

hadley commented Apr 17, 2018 •

edited

Loading

abalter commented Oct 4, 2020 •

edited

Loading