Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add caching for BigQuery table definitions #34135

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Strikerrx01
Copy link

@Strikerrx01 Strikerrx01 commented Mar 1, 2025

[Python] Add caching for BigQuery table definitions

Description

This PR addresses issue #34076 by implementing a caching mechanism for BigQuery table definitions in the BigQueryWrapper class. Currently, the get_table() method is called independently by each worker, which can lead to BigQuery quota issues for users. This implementation adds a cache with configurable TTL to store table definitions and reuse them across worker instances.

Changes

  1. Added a dictionary _table_cache to the BigQueryWrapper class to store table definitions with timestamps.
  2. Implemented a configurable Time-To-Live (TTL) mechanism with a default of 1 hour (3600 seconds).
  3. Added the set_table_definition_ttl method to allow adjusting the cache TTL or disabling caching by setting TTL to 0.
  4. Modified the get_table() method to check the cache before making an API call, respecting the TTL setting.
  5. Added important fix to clear the cache when switching from disabled caching (TTL=0) to enabled caching (TTL>0) to prevent stale data.
  6. Added the clear_table_cache() method to allow selectively clearing the entire cache or specific entries.
  7. Added comprehensive unit tests to verify the caching behavior and TTL functionality.
  8. Added thread-safety with a lock to ensure proper cache handling in multi-threaded environments.

Benefits

  • Reduces the number of BigQuery API calls for table definitions.
  • Mitigates quota issues for users with many workers accessing the same table.
  • Improves performance by avoiding redundant API calls.
  • Provides flexible control over caching behavior through configurable TTL.
  • Ensures cache consistency when changing TTL settings.
  • Thread-safe implementation for reliable operation in distributed environments.

Testing

Added unit tests that verify:

  • Cache is used for subsequent lookups of the same table.
  • Cache entries expire based on the TTL configuration.
  • Setting TTL to 0 disables caching.
  • Changing TTL from 0 to a positive value clears the cache to ensure fresh data.
  • Tables can be selectively cleared from the cache.
  • The entire cache can be cleared when needed.

Additional Notes

  • The cache is maintained at the instance level of the BigQueryWrapper class.
  • The cache key is constructed using the format "{project_id}:{dataset_id}.{table_id}".
  • Thread-safety is ensured with a threading.RLock to handle concurrent access to the cache.
  • The default TTL is set to 3600 seconds (1 hour) but can be customized based on specific requirements.
  • When TTL is set to 0, caching is effectively disabled, and each call fetches fresh data from the API.

Fixes #34076

Please Review

@apache/beam-maintainers

Copy link
Contributor

github-actions bot commented Mar 1, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from b31af48 to cfff9ef Compare March 2, 2025 10:22
@Strikerrx01 Strikerrx01 closed this Mar 2, 2025
@Strikerrx01 Strikerrx01 deleted the bigquery-table-cache branch March 2, 2025 10:24
@Strikerrx01 Strikerrx01 restored the bigquery-table-cache branch March 2, 2025 10:26
@Strikerrx01 Strikerrx01 reopened this Mar 2, 2025
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from cfff9ef to 6cc2629 Compare March 2, 2025 10:32
@liferoad liferoad requested a review from ahmedabu98 March 2, 2025 13:22
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch 2 times, most recently from f80bd52 to 890670a Compare March 2, 2025 16:39
- Fix field_type.upper() redundant call in beam_row_from_dict
- Fix temp dataset handling logic in BigQueryWrapper.__init__
- Add comprehensive test coverage for table definition caching
- Add proper thread safety with RLock
- Add documentation and comments

Fixes apache#34076
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from 890670a to 8c5d460 Compare March 4, 2025 11:07
@Strikerrx01
Copy link
Author

Strikerrx01 commented Mar 4, 2025

R: @stankiewicz Thanks for the review. I've addressed your feedback:

  1. Removed redundant field_type.upper() call since the type is already uppercase
  2. Fixed temp dataset handling logic to preserve original behavior with temp_dataset_id or (BigQueryWrapper.TEMP_DATASET + self._temporary_table_suffix)
  3. Added comprehensive test coverage for caching functionality in bigquery_tools_test.py

Could you please review the updated code in bigquery_tools.py and bigquery_tools_test.py? The new tests verify caching behavior, TTL validation, and thread safety.

Please let me know if you'd like me to explain anything further.

Copy link
Contributor

github-actions bot commented Mar 4, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@Strikerrx01 Strikerrx01 requested a review from stankiewicz March 4, 2025 12:09
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from c496455 to 7128d38 Compare March 9, 2025 05:38
@Strikerrx01 Strikerrx01 closed this Mar 9, 2025
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from 7128d38 to ac1c788 Compare March 9, 2025 05:48
@Strikerrx01 Strikerrx01 reopened this Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Cache the BigQuery table definition instead of calling tables.get() from every worker
4 participants