Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google bigquery implementation complete #123

Merged
merged 9 commits into from Mar 20, 2024

Conversation

mohanish2504
Copy link
Contributor

@mohanish2504 mohanish2504 commented Mar 17, 2024

/claim #115
/closes #115

  • QueryCSV and QueryJSON
  • InsertFromNDJson: Uploads to CSV to GCS and then to Bigquery
  • Added GCS storage provider (gcs and s3 are interoperable, tried with but didn't work)
  • Optional GCS delete
  • Optional GCS prefix

@poundifdef Some thigs to take care:

  • Both query fromats require's user to query with dataset.table, may be need to somewhere in docs

Screenshots:
Screenshot from 2024-03-17 17-59-04
Screenshot from 2024-03-17 17-56-39
Screenshot from 2024-03-17 18-00-04
Screenshot from 2024-03-17 18-17-10

@algora-pbc algora-pbc bot mentioned this pull request Mar 17, 2024
11 tasks
@mohanish2504
Copy link
Contributor Author

mohanish2504 commented Mar 17, 2024

@poundifdef this is ready for review

Copy link
Contributor

@poundifdef poundifdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick work here. I've left some comments and questions.

pkg/storage/blobstore/gcs/gcs.go Outdated Show resolved Hide resolved
pkg/storage/blobstore/gcs/gcs.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/bigquery.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/bigquery.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/bigquery.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/insert.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/query.go Outdated Show resolved Hide resolved
[MOD] fixed download to io.Copy
[MOD] removed redudant fields
[MOD] no need for ordering columns in CSV
@mohanish2504
Copy link
Contributor Author

@poundifdef this is ready for review

Copy link
Contributor

@poundifdef poundifdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - I've tested this and it is almost ready to go. Just a few small changes.

pkg/destinations/bigquery/bigquery.go Show resolved Hide resolved
pkg/destinations/bigquery/bigquery.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/bigquery.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/insert.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/insert.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/insert.go Outdated Show resolved Hide resolved
pkg/destinations/bigquery/insert.go Outdated Show resolved Hide resolved
@mohanish2504
Copy link
Contributor Author

@poundifdef added changes

@poundifdef
Copy link
Contributor

This branch does not work any longer.

curl -X POST 'http://localhost:8080/api/data/insert/my_dataset.t?api_key=bq' --data '{"x": "y"}'

Gives this error:

14:41:55 ERR insert.go:23 > CreateEmptyTable: failed to create Table error="googleapi: Error 404: Not found: Dataset scratch-data-410814:my_dataset was not found in location US, notFound" query="CREATE TABLE IF NOT EXISTS my_dataset.t (__row_id BIGINT)"
14:41:55 ERR workers.go:41 > Unable to process message error="googleapi: Error 404: Not found: Dataset scratch-data-410814:my_dataset was not found in location US, notFound" thread=0

You probably want to use this api, right? https://cloud.google.com/bigquery/docs/datasets#go

…stream to raw sql, [MOD] file deelte support.
@mohanish2504
Copy link
Contributor Author

So split and create the dataset first?

@poundifdef
Copy link
Contributor

So split and create the dataset first?

Yes. The goal is for the user to do as little configuration as possible, so if we can automate creating a dataset then that is what I'd like to do.

@mohanish2504
Copy link
Contributor Author

mohanish2504 commented Mar 18, 2024

I will need location parameter, take it from user in config?

@poundifdef
Copy link
Contributor

I will need location parameter, take it from user in config?

Yes

@mohanish2504
Copy link
Contributor Author

How about get query here? What if dataset and table not there? Create and return empty?
Most likely user won't be querying such table.

@poundifdef
Copy link
Contributor

How about get query here? What if dataset and table not there? Create and return empty? Most likely user won't be querying such table.

If a user tries to query and the dataset or table does not exist, then we should just return the error that BigQuery throws. We only automatically create tables when the user inserts data.

@mohanish2504
Copy link
Contributor Author

How about get query here? What if dataset and table not there? Create and return empty? Most likely user won't be querying such table.

If a user tries to query and the dataset or table does not exist, then we should just return the error that BigQuery throws. We only automatically create tables when the user inserts data.

That is exactly for current case, pushing.

@mohanish2504
Copy link
Contributor Author

@poundifdef pushed the changes

@poundifdef
Copy link
Contributor

Thank you for your persistence. I still get an error when running the same test as above:

curl -X POST 'http://localhost:8080/api/data/insert/my_dataset.t?api_key=bq' --data '{"x": "y"}'

Here is the error:

17:02:36 TRC types.go:59 > column_type_counts={"__row_id":{"int":1},"x":{"string":1}}
17:02:36 TRC types.go:86 > column_types={"__row_id":"int","x":"string"}
17:02:37 INF insert.go:119 > Uploading file to GCS 
17:02:38 INF insert.go:125 > Uploaded file to GCS gcs_file=b/my_dataset.t2/1_my_dataset.t2_1769831789700034560.ndjson.ndjson
17:02:38 INF insert.go:127 > Streaming data to BigQuery
17:02:39 ERR insert.go:172 > StreamDataToBigQuery: failed to stream data to BigQuery error="googleapi: Error 400: Invalid schema update. Field x has changed type from STRING to BOOLEAN, invalid" query="LOAD DATA INTO my_dataset.t2 FROM FILES ( format = 'JSON', uris = ['gs://scratchdata-test-bucket/b/my_dataset.t2/1_my_dataset.t2_1769831789700034560.ndjson.ndjson'] )"
17:02:39 ERR insert.go:130 > Failed to stream data to BigQuery error="googleapi: Error 400: Invalid schema update. Field x has changed type from STRING to BOOLEAN, invalid"
17:02:39 ERR insert.go:184 > Failed to upload and stream data to BigQuery error="googleapi: Error 400: Invalid schema update. Field x has changed type from STRING to BOOLEAN, invalid" file=data/worker/1_my_dataset.t2_1769831789700034560.ndjson.ndjson table=my_dataset.t2
17:02:39 ERR workers.go:41 > Unable to process message error="googleapi: Error 400: Invalid schema update. Field x has changed type from STRING to BOOLEAN, invalid" thread=0

I believe this is because BigQuery's autodetect is converting the string "y" into a boolean before inserting.

@mohanish2504
Copy link
Contributor Author

You are right @poundifdef, that's the case of autodetection
Thanks for being pateint.

@mohanish2504
Copy link
Contributor Author

@poundifdef got chance to test this?

@poundifdef poundifdef merged commit a7ab34c into scratchdata:main Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery Destination
2 participants