Skip to content

Add bigquery support#1002

Merged
Tarrasch merged 2 commits intospotify:masterfrom
vine:bq
Jun 18, 2015
Merged

Add bigquery support#1002
Tarrasch merged 2 commits intospotify:masterfrom
vine:bq

Conversation

@mikekap
Copy link
Contributor

@mikekap mikekap commented Jun 10, 2015

Building on the GCS support added in the previous commit, this adds support for loading datasets from GCS into
Bigquery & running queries to generate other tables.

There's no export function, but it would be trivial to add one (and I may in the very near future). The docs for
that are at https://cloud.google.com/bigquery/exporting-data-from-bigquery .

A word of warning - the tests take FOREVER to run (the simple load data test alone task 120s!). I'm not sure how to
make this better given there's no mocks for the BQ api (and moreover there's a ton of parameters that need checking).

Building on the GCS support added in the previous commit, this adds support for loading datasets from GCS into
Bigquery & running queries to generate other tables.

There's no export function, but it would be trivial to add one (and I may in the very near future). The docs for
that are at https://cloud.google.com/bigquery/exporting-data-from-bigquery .

A word of warning - the tests take FOREVER to run (the simple load data test alone task 120s!). I'm not sure how to
make this better given there's no mocks for the BQ api (and moreover there's a ton of parameters that need checking).
@Tarrasch
Copy link
Contributor

Cool! :D

Given how often a (project_id, dataset_id, table_id) triple is passed around all the time. Wouldn't it make more sense to make that into a NamedTuple?

You could create one Table specifier with BQTable(project_id='foo', dataset_id='bar', table_id='baz') then maybe. Code would then look like bq_client.exists(bqtable) instead of bq_client.exists(project_id='foo', dataset_id='bar', table_id='baz')

@mikekap
Copy link
Contributor Author

mikekap commented Jun 11, 2015

Hmm I see your point. On the other hand exists is a bad candidate for this - it doesn't accept a table, it accepts a table or a dataset. Quite a few methods are actually like that - e.g. list_datasets & list_tables.

The big eyesore is probably copy() with its 6 params. Maybe I could convert that to a task that takes 2 BigqueryTargets (which are effectively, the namedtuple you describe)? Do you think that's closer to what you're looking for or should I make something like a BQDataset & BQTable?

Separately, the internal version of this code actually just implemented a FileSystem interface over bigquery in the form of bq://project_id/dataset_id/table_id, which did end up using a namedtuple to parse the URL. Maybe something like that (although without implementing FileSystem) might be closer to what you're looking for? Alternatively I could make a tuple like BigqueryObject which might be a project, dataset, or table.

@Tarrasch
Copy link
Contributor

Hmm, yea, I think I stared over copy a bit too much. It's not true when I said that the triple is being passed around so much. As you pointed out, sometimes you omit the table_id and sometimes even the dataset_id.

Also, can you tell me more about the "FileSystem interface over bigquery in the form of bq://project_id/dataset_id/table_id". Are you thinking of resources in terms of their URI? I've been thinking about something similar for a while. Maybe we can collaborate on something?

The BigqueryObject idea is probably just going to add negative value I think, I don't really like a language to be untyped, lets not make it worse by ambiguous sum types. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this syntax valid in Python 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it turns out. The tests pass on python3 for me.

@sisidra
Copy link
Contributor

sisidra commented Jun 11, 2015

I vote for "BQDataset & BQTable". I think those are semantically distinct objects.
Also +1 for URIfication. :)

@Tarrasch
Copy link
Contributor

Just so we're on the same page. You intend that BQDataset is a NamedTuple('project_id dataset_id') meanwhile BQTable is a NamedTuple('project_id dataset_id table_id'). Right? (I mean something like that, need not necessarily to defined exactly like that)

@mikekap
Copy link
Contributor Author

mikekap commented Jun 12, 2015

Updated to use tuples (and fixed docstrings). I left BigqueryTarget as a non-tuple, since I'm assuming in most cases folks won't care about BQClient internals and just want to run a load/query job.

@mikekap
Copy link
Contributor Author

mikekap commented Jun 12, 2015

As for the URI approach - yea in our codebase we have a central "get (flag) target by uri" function that understands s3:// mock:// gs:// file:// and bq://. It comes in very useful when writing unit tests (e.g. to redirect output to mock://). Having some kind of system for doing that within luigi would be nice, but certainly it's not difficult to roll your own right now.

@mikekap
Copy link
Contributor Author

mikekap commented Jun 17, 2015

friendly ping :)

@Tarrasch
Copy link
Contributor

Oh super sorry! I'll try to look at this tomorrow!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't you could inherit like that. Nice trick!

Tarrasch added a commit that referenced this pull request Jun 18, 2015
Add bigquery support
@Tarrasch Tarrasch merged commit 9c4c47a into spotify:master Jun 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants