-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Add caching for BigQuery table definitions #34135
base: master
Are you sure you want to change the base?
Conversation
fcba87c
to
cfff9ef
Compare
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
b31af48
to
cfff9ef
Compare
cfff9ef
to
6cc2629
Compare
f80bd52
to
890670a
Compare
- Fix field_type.upper() redundant call in beam_row_from_dict - Fix temp dataset handling logic in BigQueryWrapper.__init__ - Add comprehensive test coverage for table definition caching - Add proper thread safety with RLock - Add documentation and comments Fixes apache#34076
890670a
to
8c5d460
Compare
R: @stankiewicz Thanks for the review. I've addressed your feedback:
Could you please review the updated code in Please let me know if you'd like me to explain anything further. |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
c496455
to
7128d38
Compare
7128d38
to
ac1c788
Compare
[Python] Add caching for BigQuery table definitions
Description
This PR addresses issue #34076 by implementing a caching mechanism for BigQuery table definitions in the
BigQueryWrapper
class. Currently, theget_table()
method is called independently by each worker, which can lead to BigQuery quota issues for users. This implementation adds a cache with configurable TTL to store table definitions and reuse them across worker instances.Changes
_table_cache
to theBigQueryWrapper
class to store table definitions with timestamps.set_table_definition_ttl
method to allow adjusting the cache TTL or disabling caching by setting TTL to 0.get_table()
method to check the cache before making an API call, respecting the TTL setting.clear_table_cache()
method to allow selectively clearing the entire cache or specific entries.Benefits
Testing
Added unit tests that verify:
Additional Notes
BigQueryWrapper
class."{project_id}:{dataset_id}.{table_id}"
.threading.RLock
to handle concurrent access to the cache.Fixes #34076
Please Review
@apache/beam-maintainers