Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigDiffy comparison between natively partitioned BigQuery tables #483

Open
jrmcglynn opened this issue Sep 30, 2021 · 4 comments
Open

bigDiffy comparison between natively partitioned BigQuery tables #483

jrmcglynn opened this issue Sep 30, 2021 · 4 comments
Assignees

Comments

@jrmcglynn
Copy link
Contributor

Currently, bigDiffy does not seem to support comparing natively partitioned BigQuery tables. I ran bigDiffy with $<partition>-decorated BQ tables on the rhs and lhs. The $partition argument was not retained in the arguments to the Dataflow job.

The job did not read any input records (according to the job graph) and failed after 15 minutes with NullPointerException messages (not 100% sure if the null pointers are related to the native partitioning).

FYI @catherinejelder

@jrmcglynn
Copy link
Contributor Author

@catherinejelder my instinct is to set this up to work automatically for inputs with a $partition decorator without specifying any additional arguments.

the updates should be incorporated here. I believe this method is not currently tested. is that accurate? or am I missing it somewhere? just want to adhere to existing approach if there is a test for this method currently.

@jrmcglynn
Copy link
Contributor Author

After going down the rabbit hole on the previous PR, I concluded that there were too many cases to cover to allow the user to simply feed in a $partition decorator on a natively partitioned table. We would need to determine what type of partitioning is implemented (date or integer), the size of the ranges (hourly, daily, monthly, some sort of integer range), and the field the table is partitioned on, before constructing a rowRestriction that selects the user's desired partition.

Much simpler is #489, adding an optional rowRestriction parameter in the CLI. Might be a nice bonus that users could use this in other ways as well, such as diffing only on "US" data to reduce the size of the data processed.

@catherinejelder
Copy link
Contributor

rowRestriction and the storage api seem very useful, thanks! I left a couple comments and then will probably ask another blizzard to take a look since I don't work in this repo very often

@jrmcglynn
Copy link
Contributor Author

@idreeskhan can you link the merged PR and close this issue? I don't have permissions to link the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants