Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support require partition filter #112

Open
dgitis opened this issue Jan 25, 2023 · 5 comments
Open

Support require partition filter #112

dgitis opened this issue Jan 25, 2023 · 5 comments

Comments

@dgitis
Copy link
Collaborator

dgitis commented Jan 25, 2023

BigQuery has a require partition filter option that can only be set when creating a partitioned table (unlike table expiration which can be edited in the BigQuery console). As a result, the easiest way to set this config is via dbt when it creates or full-refreshes the models.

I expect that anyone wanting to configure this, will want to configure it for all models. The multi-site branch related to #107 has a partition header macro. It would be best to implement this configuration there after it gets merged into main.

@adamribaudo-velir
Copy link
Collaborator

This is a noble goal, but every model in the package would need to respect this. We have so much work ahead to even get models like dim_ga4__sessions to utilize partitioning. My preference would be to close this issue and open issues related to the individual tasks that need to be done: Ex: Update dim_ga4__sessions to allow incremental build and use partitions, Update dim_ga4__user_ids to allow incremental build and use partitions, etc

Remember that any model that scans all partitions, by definition, does not use a partition filter. We have lots of those models. So I don't think it's as easy as setting a config variable.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 28, 2023

Manually adding this to every model is not what I intended by opening this issue. What we need to do to is to macro our partition header and where clauses.

Some recent updates to our data marts only support dynamic partitioning while our base partitions support both dynamic and static partitions. This is an oversight that would be fixed by having partition macros applies across all models and it would make adding partition filters trivial.

Also, I'm working on multisite right now which makes the advantages of templating things like partitioning seem more practical.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 28, 2023

The issue that I've come across is that moving incremental headers into a macro puts partitions_to_replace outside of the scope of the parent model resulting in static where clauses ending up with no condition.

where event_date_dt = ""

The fix as I see it is to set all partitions macros in the same file. One sets partitions_to_replace. The other two call partitions_to_replace and output the header and where clause code.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 28, 2023

Adding a require partition filter to this structure should be trivial.

@adamribaudo-velir
Copy link
Collaborator

Sorry, I'm not concerned about the mechanics of adding macros for where clauses and configs, I'm concerned about the logical implications of requiring partition filters.

Take for example, stg_ga4__sessions_traffic_sources.sql . If you impose a partition filter, you'll split any multi-day sessions and the first_value function will seek within the 2nd day, rather than the 1st day.

Or take stg_ga4__user_id_mapping. If it's made incremental, then we unintentially start introducing many-to-many relationships between user_pseudo_ids and user_ids.

I definitely want to work on each of these issues, but we have to be careful of the logical implications of adding where filters and incremental materializations. It could change the data in unexpected ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants