Add bigquery support by mikekap · Pull Request #1002 · spotify/luigi

mikekap · 2015-06-10T17:32:53Z

Building on the GCS support added in the previous commit, this adds support for loading datasets from GCS into
Bigquery & running queries to generate other tables.

There's no export function, but it would be trivial to add one (and I may in the very near future). The docs for
that are at https://cloud.google.com/bigquery/exporting-data-from-bigquery .

A word of warning - the tests take FOREVER to run (the simple load data test alone task 120s!). I'm not sure how to
make this better given there's no mocks for the BQ api (and moreover there's a ton of parameters that need checking).

Building on the GCS support added in the previous commit, this adds support for loading datasets from GCS into Bigquery & running queries to generate other tables. There's no export function, but it would be trivial to add one (and I may in the very near future). The docs for that are at https://cloud.google.com/bigquery/exporting-data-from-bigquery . A word of warning - the tests take FOREVER to run (the simple load data test alone task 120s!). I'm not sure how to make this better given there's no mocks for the BQ api (and moreover there's a ton of parameters that need checking).

Tarrasch · 2015-06-10T18:59:35Z

Cool! :D

Given how often a (project_id, dataset_id, table_id) triple is passed around all the time. Wouldn't it make more sense to make that into a NamedTuple?

You could create one Table specifier with BQTable(project_id='foo', dataset_id='bar', table_id='baz') then maybe. Code would then look like bq_client.exists(bqtable) instead of bq_client.exists(project_id='foo', dataset_id='bar', table_id='baz')

mikekap · 2015-06-11T04:04:03Z

Hmm I see your point. On the other hand exists is a bad candidate for this - it doesn't accept a table, it accepts a table or a dataset. Quite a few methods are actually like that - e.g. list_datasets & list_tables.

The big eyesore is probably copy() with its 6 params. Maybe I could convert that to a task that takes 2 BigqueryTargets (which are effectively, the namedtuple you describe)? Do you think that's closer to what you're looking for or should I make something like a BQDataset & BQTable?

Separately, the internal version of this code actually just implemented a FileSystem interface over bigquery in the form of bq://project_id/dataset_id/table_id, which did end up using a namedtuple to parse the URL. Maybe something like that (although without implementing FileSystem) might be closer to what you're looking for? Alternatively I could make a tuple like BigqueryObject which might be a project, dataset, or table.

Tarrasch · 2015-06-11T08:42:00Z

Hmm, yea, I think I stared over copy a bit too much. It's not true when I said that the triple is being passed around so much. As you pointed out, sometimes you omit the table_id and sometimes even the dataset_id.

Also, can you tell me more about the "FileSystem interface over bigquery in the form of bq://project_id/dataset_id/table_id". Are you thinking of resources in terms of their URI? I've been thinking about something similar for a while. Maybe we can collaborate on something?

The BigqueryObject idea is probably just going to add negative value I think, I don't really like a language to be untyped, lets not make it worse by ambiguous sum types. :)

Tarrasch · 2015-06-11T08:42:48Z

luigi/contrib/bigquery.py

Is this syntax valid in Python 3?

Yes, it turns out. The tests pass on python3 for me.

sisidra · 2015-06-11T09:31:29Z

I vote for "BQDataset & BQTable". I think those are semantically distinct objects.
Also +1 for URIfication. :)

Tarrasch · 2015-06-11T09:45:41Z

Just so we're on the same page. You intend that BQDataset is a NamedTuple('project_id dataset_id') meanwhile BQTable is a NamedTuple('project_id dataset_id table_id'). Right? (I mean something like that, need not necessarily to defined exactly like that)

mikekap · 2015-06-12T17:49:27Z

Updated to use tuples (and fixed docstrings). I left BigqueryTarget as a non-tuple, since I'm assuming in most cases folks won't care about BQClient internals and just want to run a load/query job.

mikekap · 2015-06-12T17:57:23Z

As for the URI approach - yea in our codebase we have a central "get (flag) target by uri" function that understands s3:// mock:// gs:// file:// and bq://. It comes in very useful when writing unit tests (e.g. to redirect output to mock://). Having some kind of system for doing that within luigi would be nice, but certainly it's not difficult to roll your own right now.

mikekap · 2015-06-17T18:33:30Z

friendly ping :)

Tarrasch · 2015-06-17T20:17:16Z

Oh super sorry! I'll try to look at this tomorrow!

Tarrasch · 2015-06-17T20:18:59Z

luigi/contrib/bigquery.py

Oh I didn't you could inherit like that. Nice trick!

Add bigquery support

Tarrasch reviewed Jun 11, 2015
View reviewed changes

Switch to named tuples instead of args for BQClient params.

63d1b3a

Tarrasch reviewed Jun 17, 2015
View reviewed changes

luigi/contrib/bigquery.py

Copy link

Contributor

Tarrasch Jun 17, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't you could inherit like that. Nice trick!

Tarrasch added a commit that referenced this pull request Jun 18, 2015

Merge pull request #1002 from vine/bq

9c4c47a

Add bigquery support

Tarrasch merged commit 9c4c47a into spotify:master Jun 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bigquery support#1002

Add bigquery support#1002
Tarrasch merged 2 commits intospotify:masterfrom
vine:bq

mikekap commented Jun 10, 2015

Uh oh!

Tarrasch commented Jun 10, 2015

Uh oh!

mikekap commented Jun 11, 2015

Uh oh!

Tarrasch commented Jun 11, 2015

Uh oh!

Tarrasch Jun 11, 2015

Uh oh!

mikekap Jun 12, 2015

Uh oh!

sisidra commented Jun 11, 2015

Uh oh!

Tarrasch commented Jun 11, 2015

Uh oh!

mikekap commented Jun 12, 2015

Uh oh!

mikekap commented Jun 12, 2015

Uh oh!

mikekap commented Jun 17, 2015

Uh oh!

Tarrasch commented Jun 17, 2015

Uh oh!

Tarrasch Jun 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mikekap commented Jun 10, 2015

Uh oh!

Tarrasch commented Jun 10, 2015

Uh oh!

mikekap commented Jun 11, 2015

Uh oh!

Tarrasch commented Jun 11, 2015

Uh oh!

Tarrasch Jun 11, 2015

Choose a reason for hiding this comment

Uh oh!

mikekap Jun 12, 2015

Choose a reason for hiding this comment

Uh oh!

sisidra commented Jun 11, 2015

Uh oh!

Tarrasch commented Jun 11, 2015

Uh oh!

mikekap commented Jun 12, 2015

Uh oh!

mikekap commented Jun 12, 2015

Uh oh!

mikekap commented Jun 17, 2015

Uh oh!

Tarrasch commented Jun 17, 2015

Uh oh!

Tarrasch Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants