Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Remove default limit on to_pandas() (#37418) #37420

Merged
merged 3 commits into from
Jul 18, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions python/ray/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3806,12 +3806,11 @@ def to_spark(self, spark: "pyspark.sql.SparkSession") -> "pyspark.sql.DataFrame"
)

@ConsumptionAPI(pattern="Time complexity:")
def to_pandas(self, limit: int = 100000) -> "pandas.DataFrame":
"""Convert this :class:`~ray.data.Dataset` into a single pandas DataFrame.
def to_pandas(self, limit: int = None) -> "pandas.DataFrame":
"""Convert this :class:`~ray.data.Dataset` to a single pandas DataFrame.

This method errors if the number of rows exceeds the
provided ``limit``. You can use :meth:`.limit` on the dataset
beforehand to truncate the dataset manually.
This method errors if the number of rows exceeds the provided ``limit``.
To truncate the dataset beforehand, call :meth:`.limit`.

Examples:
>>> import ray
Expand All @@ -3825,24 +3824,25 @@ def to_pandas(self, limit: int = 100000) -> "pandas.DataFrame":
Time complexity: O(dataset size)

Args:
limit: The maximum number of records to return. An error is
raised if the dataset has more rows than this limit.
limit: The maximum number of rows to return. An error is
raised if the dataset has more rows than this limit. Defaults to
``None``, which means no limit.

Returns:
A pandas DataFrame created from this dataset, containing a limited
number of records.
number of rows.

Raises:
ValueError: if the number of rows in the :class:`~ray.data.Dataset` exceeds
``limit``.
"""
count = self.count()
if count > limit:
if limit is not None and count > limit:
raise ValueError(
f"the dataset has more than the given limit of {limit} "
f"records: {count}. If you are sure that a DataFrame with "
f"{count} rows will fit in local memory, use "
f"ds.to_pandas(limit={count})."
f"rows: {count}. If you are sure that a DataFrame with "
f"{count} rows will fit in local memory, set ds.to_pandas(limit=None) "
"to disable limits."
)
blocks = self.get_internal_block_refs()
output = DelegatingBlockBuilder()
Expand Down
Loading