-
Notifications
You must be signed in to change notification settings - Fork 50
feat: Add bpd.options.compute.maximum_result_rows
option to limit client data download
#1829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit introduces a new compute option `bigframes.pandas.options.compute.maximum_rows_downloaded` that allows you to set a limit on the maximum number of rows that can be downloaded to a client machine. When this option is set and a data-downloading operation (e.g., `to_pandas()`, `to_pandas_batches()`) attempts to download more rows than the configured limit, a `bigframes.exceptions.MaximumRowsDownloadedExceeded` exception is raised. This feature helps prevent Out-Of-Memory (OOM) errors in shared execution environments by providing a mechanism to control the amount of data downloaded to the client. The limit is checked in both `DirectGbqExecutor` and `BigQueryCachingExecutor`. Unit tests have been added to verify the functionality, including scenarios where the limit is not set, set but not exceeded, and set and exceeded for various DataFrame operations. Documentation has been updated by ensuring the docstring for the new option in `ComputeOptions` is comprehensive for automatic generation.
This commit refactors the row limit check logic in `DirectGbqExecutor` and `BigQueryCachingExecutor` to use a new shared helper function `check_row_limit` located in `bigframes.session.utils`. This change reduces code duplication and improves maintainability. The functionality remains the same as before the refactoring.
@@ -39,10 +39,11 @@ | |||
import bigframes.core.schema as schemata | |||
import bigframes.core.tree_properties as tree_properties | |||
import bigframes.dtypes | |||
import bigframes.exceptions as bfe | |||
from bigframes.exceptions import MaximumRowsDownloadedExceeded, QueryComplexityError, format_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't import Classes.
Another maybe more general option would be to truncate the arrow iterator itself, throwing if iteration is attempted past max rows? Ideally we'd be able to add this functionality by only modifying ExecuteResult, and keep the executor itself from taking on another responsibility? |
bpd.options.compute.maximum_rows_downloaded
option to limit client data downloadbpd.options.compute.maximum_result_rows
option to limit client data download
Great idea @TrevorBergeron ! Pushed this change to ExecuteResult in the latest commit. |
This commit introduces a new compute option
bigframes.pandas.options.compute.maximum_rows_downloaded
that allows you to set a limit on the maximum number of rows that can be downloaded to a client machine.When this option is set and a data-downloading operation (e.g.,
to_pandas()
,to_pandas_batches()
) attempts to download more rows than the configured limit, abigframes.exceptions.MaximumRowsDownloadedExceeded
exception is raised.This feature helps prevent Out-Of-Memory (OOM) errors in shared execution environments by providing a mechanism to control the amount of data downloaded to the client.
The limit is checked in both
DirectGbqExecutor
andBigQueryCachingExecutor
. Unit tests have been added to verify the functionality, including scenarios where the limit is not set, set but not exceeded, and set and exceeded for various DataFrame operations.Documentation has been updated by ensuring the docstring for the new option in
ComputeOptions
is comprehensive for automatic generation.Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕