Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle records and arrays #145

Closed
ArielSSchwartz opened this issue Mar 25, 2017 · 12 comments
Closed

Handle records and arrays #145

ArielSSchwartz opened this issue Mar 25, 2017 · 12 comments
Labels
feature a feature request or enhancement

Comments

@ArielSSchwartz
Copy link

Can extract_data be changed to support optionally returning structured JSON data instead of a data.frame? This is useful when the query results includes ARRAYs and STRUCTs.

@ArielSSchwartz ArielSSchwartz changed the title Return JSON structured data instead of a data_frame Return JSON structured data instead of a data.frame Mar 25, 2017
@hadley
Copy link
Member

hadley commented Apr 18, 2017

I think it should return a data frame with list columns in that scenario. Could you provide a minimal reprex of such data?

@hadley hadley added the reprex needs a minimal reproducible example label Apr 18, 2017
@ArielSSchwartz
Copy link
Author

Here is an example query:

SELECT * FROM [bigquery-public-data:samples.github_nested] WHERE repository.owner IN ('hadley') LIMIT 100

Rather than having to flatten repeated fields (ARRAYS), I would like to capture the JSON output and load it into a structured list in R.

@hadley
Copy link
Member

hadley commented Apr 19, 2017

Thanks, that's useful. I think it should be trivial to return this data as a list-column.

@hadley
Copy link
Member

hadley commented Apr 20, 2017

Even with the query explorer, and this simpler query:

SELECT repository 
FROM [bigquery-public-data.samples.github_nested]
WHERE repository.owner IN ('hadley') LIMIT 100

I had to check "allow large results" and uncheck "flatten" results.

Do you have an example of this working elsewhere? (I just want to double check that it is really this hard; I thought it would be easier)

@hadley hadley added feature a feature request or enhancement and removed reprex needs a minimal reproducible example labels Apr 20, 2017
@ArielSSchwartz
Copy link
Author

Yes. This this the only way I know to get nested JSON results from BigQuery Web UI.
Using the command line tools you can try something like:
bq --format prettyjson query "SELECT repository.* FROM [bigquery-public-data.samples.github_nested] WHERE repository.owner IN ('hadley') LIMIT 100"

@craigcitro
Copy link
Collaborator

You can also use the UI -- use standard SQL, and click the button to view results as JSON.

The --format flag was, in fact, designed for precisely this use case, so kudos for finding it. 😁

@hadley
Copy link
Member

hadley commented Apr 6, 2018

Reprex with latest (in-dev) API:

tb <- bq_project_query(bq_test_project(), 
  "SELECT * FROM publicdata.samples.github_nested WHERE repository.owner IN ('hadley') LIMIT 100",
  use_legacy_sql = FALSE
)
df <- bq_table_download(tb)

It's a bit harder than I had ancticipated thanks to the extreme nesting of the json.

@hadley
Copy link
Member

hadley commented Apr 8, 2018

Focussing the query a little to get a smaller selection repeated and non-repeated records.

tb <- bq_project_query(bq_test_project(), 
  "SELECT repository, type, payload.member, payload.shas
    FROM publicdata.samples.github_nested 
    WHERE repository.owner IN ('hadley') 
    LIMIT 100",
  use_legacy_sql = FALSE
)

@zippeurfou
Copy link

Here is another example:

query <- "SELECT APPROX_QUANTILES(x, 4) AS output, 2 as t FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x"
mydf <- query_exec(query, project = project, use_legacy_sql = FALSE, max_page = Inf)
> mydf
  output  t
1      1 25
2     50 75
3    100  2

vs what it look like from the UI
image
and the json

[
  {
    "output": [
      "1",
      "25",
      "50",
      "75",
      "100"
    ],
    "t": "2"
  }
]

Doing the same query the way @hadley did trigger an error:

> tb <- bq_project_query(project,query,use_legacy_sql=FALSE)
Complete
Billed: 0 B
> df <- bq_table_download(tb)
Error in bq_tabledata_to_list(json) :
  embedded nul in string: '\001\0\0\0\001'

@hadley hadley changed the title Return JSON structured data instead of a data.frame Handle records and arrays Apr 12, 2018
@hadley
Copy link
Member

hadley commented Apr 13, 2018

Thanks @zippeurfou here's an even more minimal example for testing:

tb1 <- bq_project_query(bq_test_project(), "SELECT STRUCT(1 AS a, 'abc' AS b) as x")
bq_table_download(tb1)

tb2 <- bq_project_query(bq_test_project(), "SELECT GENERATE_ARRAY(1, 10) as x")
bq_table_download(tb2)

hadley added a commit that referenced this issue Apr 17, 2018
Parse JSON in C++: this considerably improves performance and adds full support for arrays, records, and arrays of records.

Fixes #145
@bbhoss
Copy link

bbhoss commented Feb 26, 2019

It seems that the schema generator doesn't support saving lists back to bigquery as a repeated field? Am I missing something or should I open a ticket?

@hadley
Copy link
Member

hadley commented Feb 26, 2019

@bbhoss please file a new issue with simple reprex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants