Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Don't load Arrow PyExtensionType by default #45084

Merged
merged 3 commits into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion python/ray/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,16 @@
# anything.
pass
else:
if parse_version(pyarrow_version) >= parse_version("14.0.1"):
from ray._private.ray_constants import env_bool

RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE = env_bool(
"RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE", False
)

if (
parse_version(pyarrow_version) >= parse_version("14.0.1")
and RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE
):
pa.PyExtensionType.set_auto_load(True)
# Import these arrow extension types to ensure that they are registered.
from ray.air.util.tensor_extensions.arrow import ( # noqa
Expand Down
26 changes: 26 additions & 0 deletions python/ray/data/datasource/parquet_datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,30 @@ def set_schema_pickled(self, schema_pickled: bytes):
self.schema_pickled = schema_pickled


def _check_for_legacy_tensor_type(schema):
"""Check for the legacy tensor extension type and raise an error if found.

Ray Data uses an extension type to represent tensors in Arrow tables. Previously,
the extension type extended `PyExtensionType`. However, this base type can expose
users to arbitrary code execution. To prevent this, we don't load the type by
default.
"""
import pyarrow as pa

for name, type in zip(schema.names, schema.types):
if isinstance(type, pa.UnknownExtensionType) and isinstance(
type, pa.PyExtensionType
):
raise RuntimeError(
f"Ray Data couldn't infer the type of column '{name}'. This might mean "
"you're trying to read data written with an older version of Ray. "
"Reading data written with older versions of Ray might expose you to "
"arbitrary code execution. To try reading the data anyway, set "
"`RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE=1` on all nodes."
"To learn more, see https://github.com/ray-project/ray/issues/41314."
Comment on lines +178 to +181

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this message slightly confusing (or underselling the risk). It's not only data "written with older versions of Ray" that might expose the user, it's reading any random Parquet file that will expose the user if you enabled this RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE=1 and have imported Ray

)


@PublicAPI
class ParquetDatasource(Datasource):
"""Parquet datasource, for reading and writing Parquet files.
Expand Down Expand Up @@ -258,6 +282,8 @@ def __init__(
[schema.field(column) for column in columns], schema.metadata
)

_check_for_legacy_tensor_type(schema)

if _block_udf is not None:
# Try to infer dataset schema by passing dummy table through UDF.
dummy_table = schema.empty_table()
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading