Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add performant way to read large tfrecord datasets #42277

Merged
merged 40 commits into from
Feb 29, 2024

Commits on Jan 24, 2024

  1. feat: add performant way to read large tfrecord datasets

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    af7b677 View commit details
    Browse the repository at this point in the history
  2. add tfx-bsl as a test dependency

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    f25fa2e View commit details
    Browse the repository at this point in the history
  3. address PR comments

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    14ed874 View commit details
    Browse the repository at this point in the history
  4. properly enable/disable fast read on tests

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    3b8bf91 View commit details
    Browse the repository at this point in the history
  5. resolve rabsolute path from relative

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    b303cac View commit details
    Browse the repository at this point in the history
  6. add tensorflow-io for s3 fs impl

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    ddfca72 View commit details
    Browse the repository at this point in the history
  7. try adding tfx-bsl, cython in data-test-requirements

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    c3beb21 View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2024

  1. skip tfx-bsl install in data

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    4ac5221 View commit details
    Browse the repository at this point in the history
  2. Apply suggestions from code review

    Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
    Signed-off-by: Martin <martinbomio@gmail.com>
    martinbomio and scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    552c2e0 View commit details
    Browse the repository at this point in the history
  3. new tfx-bsl build

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    4a007c6 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'martinbomio/fast-tfrecord-read' of https://github.com/m…

    …artinbomio/ray into martinbomio/fast-tfrecord-read
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    3c20737 View commit details
    Browse the repository at this point in the history
  5. lint

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    bf061db View commit details
    Browse the repository at this point in the history
  6. fix missing build dependency

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    3d7bd33 View commit details
    Browse the repository at this point in the history
  7. add datatfxbsl build

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    b8519a1 View commit details
    Browse the repository at this point in the history
  8. remove workers arg

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    b6982a3 View commit details
    Browse the repository at this point in the history
  9. worker config

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    7934adc View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2024

  1. data target

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 26, 2024
    Configuration menu
    Copy the full SHA
    386be26 View commit details
    Browse the repository at this point in the history
  2. try pinning pandas<2

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 26, 2024
    Configuration menu
    Copy the full SHA
    1d83dce View commit details
    Browse the repository at this point in the history
  3. pin pandas to pandas==1.5.3

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 26, 2024
    Configuration menu
    Copy the full SHA
    2c49d51 View commit details
    Browse the repository at this point in the history
  4. comment

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 26, 2024
    Configuration menu
    Copy the full SHA
    953af46 View commit details
    Browse the repository at this point in the history
  5. update tag

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 26, 2024
    Configuration menu
    Copy the full SHA
    c373724 View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2024

  1. add tfxbsl dockerfile

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 29, 2024
    Configuration menu
    Copy the full SHA
    e19c1fc View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2024

  1. add crc32c

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 30, 2024
    Configuration menu
    Copy the full SHA
    fb6b290 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'master' into martinbomio/fast-tfrecord-read

    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee committed Jan 30, 2024
    Configuration menu
    Copy the full SHA
    7d7a08f View commit details
    Browse the repository at this point in the history

Commits on Feb 5, 2024

  1. rewrite unwrap single value function to use pyarrow

    This way we can avoid issues with tensor array conversions
    as well as cast from large_list to list to avoid issues with
    to_tf function
    
    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 5, 2024
    Configuration menu
    Copy the full SHA
    ca27c49 View commit details
    Browse the repository at this point in the history

Commits on Feb 6, 2024

  1. Merge branch 'master' into martinbomio/fast-tfrecord-read

    Signed-off-by: Martin <martinbomio@gmail.com>
    martinbomio committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    aac5ec6 View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2024

  1. Merge branch 'martinbomio/fast-tfrecord-read' of https://github.com/m…

    …artinbomio/ray into martinbomio/fast-tfrecord-read
    scottjlee committed Feb 7, 2024
    Configuration menu
    Copy the full SHA
    7408c10 View commit details
    Browse the repository at this point in the history

Commits on Feb 13, 2024

  1. cast large_list to list always on fast read

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 13, 2024
    Configuration menu
    Copy the full SHA
    030556c View commit details
    Browse the repository at this point in the history
  2. move casting to datasource

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 13, 2024
    Configuration menu
    Copy the full SHA
    681f753 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2024

  1. Merge branch 'master' into martinbomio/fast-tfrecord-read

    Signed-off-by: Martin <martinbomio@gmail.com>
    martinbomio committed Feb 14, 2024
    Configuration menu
    Copy the full SHA
    bca5bff View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2024

  1. clean up docstrings

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    8cf1c2d View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2024

  1. rename fast_* variables to tfx_

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 26, 2024
    Configuration menu
    Copy the full SHA
    9aa260b View commit details
    Browse the repository at this point in the history
  2. fix failing tests

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 26, 2024
    Configuration menu
    Copy the full SHA
    befc187 View commit details
    Browse the repository at this point in the history
  3. add flag in data context to disable using tfx read

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 26, 2024
    Configuration menu
    Copy the full SHA
    413c1f0 View commit details
    Browse the repository at this point in the history

Commits on Feb 27, 2024

  1. disable tfx_read by default

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    1e2c627 View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2024

  1. add TFXREadOptions

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    9894178 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    bf06a37 View commit details
    Browse the repository at this point in the history
  3. fix build

    Signed-off-by: Martin Bomio <martinbomio@spotify.com>
    martinbomio committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    bf64415 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    663c39c View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2024

  1. Configuration menu
    Copy the full SHA
    550392e View commit details
    Browse the repository at this point in the history