-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Data] Add performant way to read large tfrecord datasets (#42277)
The main motivation for this PR is that ray.data.read_tfrcords yields suboptimal performance when reading large datasets. This PR adds a default "fast" route for reading tf.records that relies on tfx-bsl decoder. This approach also infers the schema when no tf_schema is provided by doing a pass of the data to determine the cardinality of the feature lists. Signed-off-by: Martin Bomio <martinbomio@spotify.com> Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Martin <martinbomio@gmail.com> Co-authored-by: Scott Lee <sjl@anyscale.com> Co-authored-by: Cheng Su <scnju13@gmail.com>
- Loading branch information
1 parent
fe554c1
commit 2c37909
Showing
7 changed files
with
361 additions
and
21 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# syntax=docker/dockerfile:1.3-labs | ||
|
||
ARG DOCKER_IMAGE_BASE_BUILD=cr.ray.io/rayproject/oss-ci-base_ml | ||
FROM $DOCKER_IMAGE_BASE_BUILD | ||
|
||
ARG ARROW_VERSION=14.* | ||
ARG ARROW_MONGO_VERSION= | ||
ARG RAY_CI_JAVA_BUILD= | ||
|
||
# Unset dind settings; we are using the host's docker daemon. | ||
ENV DOCKER_TLS_CERTDIR= | ||
ENV DOCKER_HOST= | ||
ENV DOCKER_TLS_VERIFY= | ||
ENV DOCKER_CERT_PATH= | ||
|
||
SHELL ["/bin/bash", "-ice"] | ||
|
||
COPY . . | ||
|
||
RUN <<EOF | ||
#!/bin/bash | ||
|
||
ARROW_VERSION=$ARROW_VERSION ./ci/env/install-dependencies.sh | ||
# We manually install tfx-bsl here. Adding the library via data- or | ||
# test-requirements.txt files causes unresolvable dependency conflicts with pandas. | ||
|
||
pip install -U tfx-bsl==1.14.0 crc32c==2.3 | ||
|
||
EOF |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
name: "datatfxbslbuild" | ||
froms: ["cr.ray.io/rayproject/oss-ci-base_ml"] | ||
dockerfile: ci/docker/data-tfxbsl.build.Dockerfile | ||
srcs: | ||
- ci/env/install-dependencies.sh | ||
- python/requirements.txt | ||
- python/requirements_compiled.txt | ||
- python/requirements/test-requirements.txt | ||
- python/requirements/ml/dl-cpu-requirements.txt | ||
- python/requirements/ml/data-requirements.txt | ||
build_args: | ||
- ARROW_VERSION=14.* | ||
tags: | ||
- cr.ray.io/rayproject/datatfxbslbuild |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.