Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle records and arrays #145

Closed
ArielSSchwartz opened this issue Mar 25, 2017 · 12 comments

Comments

@ArielSSchwartz
Copy link

commented Mar 25, 2017

Can extract_data be changed to support optionally returning structured JSON data instead of a data.frame? This is useful when the query results includes ARRAYs and STRUCTs.

@ArielSSchwartz ArielSSchwartz changed the title Return JSON structured data instead of a data_frame Return JSON structured data instead of a data.frame Mar 25, 2017

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 18, 2017

I think it should return a data frame with list columns in that scenario. Could you provide a minimal reprex of such data?

@hadley hadley added the reprex label Apr 18, 2017

@ArielSSchwartz

This comment has been minimized.

Copy link
Author

commented Apr 19, 2017

Here is an example query:

SELECT * FROM [bigquery-public-data:samples.github_nested] WHERE repository.owner IN ('hadley') LIMIT 100

Rather than having to flatten repeated fields (ARRAYS), I would like to capture the JSON output and load it into a structured list in R.

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 19, 2017

Thanks, that's useful. I think it should be trivial to return this data as a list-column.

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 20, 2017

Even with the query explorer, and this simpler query:

SELECT repository 
FROM [bigquery-public-data.samples.github_nested]
WHERE repository.owner IN ('hadley') LIMIT 100

I had to check "allow large results" and uncheck "flatten" results.

Do you have an example of this working elsewhere? (I just want to double check that it is really this hard; I thought it would be easier)

@hadley hadley added feature and removed reprex labels Apr 20, 2017

@ArielSSchwartz

This comment has been minimized.

Copy link
Author

commented Apr 20, 2017

Yes. This this the only way I know to get nested JSON results from BigQuery Web UI.
Using the command line tools you can try something like:
bq --format prettyjson query "SELECT repository.* FROM [bigquery-public-data.samples.github_nested] WHERE repository.owner IN ('hadley') LIMIT 100"

@craigcitro

This comment has been minimized.

Copy link
Collaborator

commented Apr 20, 2017

You can also use the UI -- use standard SQL, and click the button to view results as JSON.

The --format flag was, in fact, designed for precisely this use case, so kudos for finding it. 😁

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 6, 2018

Reprex with latest (in-dev) API:

tb <- bq_project_query(bq_test_project(), 
  "SELECT * FROM publicdata.samples.github_nested WHERE repository.owner IN ('hadley') LIMIT 100",
  use_legacy_sql = FALSE
)
df <- bq_table_download(tb)

It's a bit harder than I had ancticipated thanks to the extreme nesting of the json.

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 8, 2018

Focussing the query a little to get a smaller selection repeated and non-repeated records.

tb <- bq_project_query(bq_test_project(), 
  "SELECT repository, type, payload.member, payload.shas
    FROM publicdata.samples.github_nested 
    WHERE repository.owner IN ('hadley') 
    LIMIT 100",
  use_legacy_sql = FALSE
)
@zippeurfou

This comment has been minimized.

Copy link

commented Apr 12, 2018

Here is another example:

query <- "SELECT APPROX_QUANTILES(x, 4) AS output, 2 as t FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x"
mydf <- query_exec(query, project = project, use_legacy_sql = FALSE, max_page = Inf)
> mydf
  output  t
1      1 25
2     50 75
3    100  2

vs what it look like from the UI
image
and the json

[
  {
    "output": [
      "1",
      "25",
      "50",
      "75",
      "100"
    ],
    "t": "2"
  }
]

Doing the same query the way @hadley did trigger an error:

> tb <- bq_project_query(project,query,use_legacy_sql=FALSE)
Complete
Billed: 0 B
> df <- bq_table_download(tb)
Error in bq_tabledata_to_list(json) :
  embedded nul in string: '\001\0\0\0\001'

@hadley hadley changed the title Return JSON structured data instead of a data.frame Handle records and arrays Apr 12, 2018

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 13, 2018

Thanks @zippeurfou here's an even more minimal example for testing:

tb1 <- bq_project_query(bq_test_project(), "SELECT STRUCT(1 AS a, 'abc' AS b) as x")
bq_table_download(tb1)

tb2 <- bq_project_query(bq_test_project(), "SELECT GENERATE_ARRAY(1, 10) as x")
bq_table_download(tb2)

@hadley hadley closed this in #228 Apr 17, 2018

hadley added a commit that referenced this issue Apr 17, 2018

Merge pull request #228 from r-dbi/full-parse
Parse JSON in C++: this considerably improves performance and adds full support for arrays, records, and arrays of records.

Fixes #145
@bbhoss

This comment has been minimized.

Copy link

commented Feb 26, 2019

It seems that the schema generator doesn't support saving lists back to bigquery as a repeated field? Am I missing something or should I open a ticket?

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

@bbhoss please file a new issue with simple reprex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.