[Datasets] Add basic e2e Datasets example on NYC taxi dataset #24874

clarkzinzow · 2022-05-17T05:14:09Z

This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset.

The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data. The example doesn't use Ray AIR since AIR is still alpha, and doesn't (yet) use Ray Train in order obviate the need for:

performing real feature engineering which be overly advanced for this example,
defining a model that works with the NYC taxi data which would be non-trivial.

We can either complete the port to Ray Train using a dummy linear model in this example, or do that as a more targeted ML ingest example. I'm currently planning on the latter, but I'm happy to discuss the former!

Docs preview: https://ray--24874.org.readthedocs.build/en/24874/data/examples/index.html

TODOs

(Maybe?) Port to linear Torch model and do full Ray Train + model inference

clarkzinzow · 2022-05-17T05:16:36Z

cc @maxpumperla for the examples page and general structure.

ericl · 2022-05-17T05:34:04Z

I enjoyed reading through this one. A couple operational comments:

Seems a bunch of references aren't rendering properly, double check them?
I see a bunch of yellow warnings from requests, is there a way to hide those?

clarkzinzow · 2022-05-17T05:36:43Z

Seems a bunch of references aren't rendering properly, double check them?

Hmm strange, they are all straightforward references. 🤔 I'll dig into that.

I see a bunch of yellow warnings from requests, is there a way to hide those?

I'll try disabling warnings.

ericl · 2022-05-17T17:17:54Z

On the example data, could we also mention the ability to read the entire directory / partitioning support? Specifying individual files seems not the common case.

doc/source/data/examples/index.rst

doc/source/data/examples/nyc_taxi_basic_processing.ipynb

doc/source/data/examples/index.rst

clarkzinzow · 2022-05-17T22:54:25Z

@pcmoritz Do you want to take a quick sign-off pass on this example?

https://ray--24874.org.readthedocs.build/en/24874/data/examples/nyc_taxi_basic_processing.html

This is supposed to be one of the "basic data processing" examples, where we're going to do one of these for each data type (tabular, text, imagery) per the examples plan, where we're planning on waiting to do e2e ML workload examples (with full feature engineering + non-dummy trainers and non-dummy batch inferrers) until the AIR examples land.

clarkzinzow · 2022-05-18T01:26:10Z

@ericl @jianoaix Ping on data team final review/approval.

ericl · 2022-05-18T01:55:58Z

Can you delete the cell from the top of the notebook?

# flake8: noqa
import warnings
import os

# Suppress noisy requests warnings.
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

Maybe mark as hidden or delete it manually from JSON.

There are also a bunch of ugly stuff like

2022-05-17 19:42:53,496	WARNING read_api.py:252 -- The number of blocks in this dataset (2) limits its parallelism to 2 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.

(scheduler +12m43s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

(scheduler +13m18s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

That should be removed.

pcmoritz · 2022-05-18T06:03:28Z

Thanks a lot for putting these examples together -- they look great. I have the same comments as Eric about the verbose outputs and the cell at the top -- we should get rid of those.

Also I was wondering if and why the batch_format="pandas" part in some of the calls was needed.

The pictures for the datasets are a very nice touch!

For the second example it would be great to mention what the execution model for create_shuffle_pipeline is (am I correct in assuming that the DatasetPipeline is a lazy object/expression that gets executed when iter_batches is called?).

Ideally the second example would also use a real dataset and then e.g. train with XGBoost, but we can leave this as a follow up item.

Also, are these examples tested in CI/release tests?

clarkzinzow · 2022-05-18T16:54:12Z

Also I was wondering if and why the batch_format="pandas" part in some of the calls was needed.

Not required, I was including it to make it explicit that we're applying a Pandas UDF, but the default batch format will provide Pandas batches to the UDF. For the sake of maximally narrow API calls, are you voting that this explicit batch format be removed?

For the second example it would be great to mention what the execution model for create_shuffle_pipeline is (am I correct in assuming that the DatasetPipeline is a lazy object/expression that gets executed when iter_batches is called?).

Ideally the second example would also use a real dataset and then e.g. train with XGBoost, but we can leave this as a follow up item.

FYI that is an existing example (from many months ago) that I'm planning on tweaking in another PR, but I'll make a note of that feedback!

Also, are these examples tested in CI/release tests?

I'll add the Bazel target now for these example notebooks to run; if the second example (large-scale ML ingest) fails in CI, then I'll disable the Bazel target until I submit the follow-up PR tweaking that example.

clarkzinzow · 2022-05-19T00:03:06Z

I guess we don't have support for the note directive in notebooks?

clarkzinzow · 2022-05-19T02:26:33Z

@ericl @jianoaix @pcmoritz Ping for final review, I've implemented all above feedback, PTAL: https://ray--24874.org.readthedocs.build/en/24874/data/examples/nyc_taxi_basic_processing.html

clarkzinzow · 2022-05-19T04:50:20Z

@ericl @maxpumperla @jjyao @scv119 Need a code-owner review here.

clarkzinzow · 2022-05-19T15:56:52Z

@ericl @maxpumperla @jjyao @scv119 Ping again.

maxpumperla

looks really solid to me

…oject#24874) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data.

…A. (#25010) * [Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial. * Test the CSV read with column types specified (#24398) Make sure users can read csv with columns types specified. Users may want to do this because sometimes PyArrow's type inference doesn't work as intended, in which case users can step in and work around the type inference. * [Datasets] [Docs] Add a warning about from_huggingface (#24608) Adds a warning to docs about the intended use of from_huggingface. * [data] Expose `drop_last` in `to_tf` (#24666) * [data] More informative exceptions in block impl (#24665) * Add a classic yet small-sized ML dataset for demo/documentation/testing (#24592) To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it. IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows. * [Datasets] Add more example data. (#24795) This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code. * [Datasets] Add example protocol for reading canned in-package example data. (#24800) Providing easy-access datasets is table stakes for a good Getting Started UX, but even with good in-package data, it can be difficult to make these paths accessible to the user. This PR adds an "example://" protocol that will resolve passed paths directly to our canned in-package example data. * [minor] Use np.searchsorted to speed up random access dataset (#24825) * [Datasets] Change `range_arrow()` API to `range_table()` (#24704) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail. * [Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR. * Implement random_sample() (#24492) * Map progress bar title; pretty repr for rows. (#24672) * [Datasets] [CI] fix CI of dataset test (#24883) CI test is broken by f61caa3. This PR fixes it. * [Datasets] Add explicit resource allocation option via a top-level scheduling strategy (#24438) Instead of letting Datasets implicitly use cluster resources in the margins of explicit allocations of other libraries, such as Tune, Datasets should provide an option for explicitly allocating resources for a Datasets workload for users that want to box Datasets in. This PR adds such an explicit resource allocation option, via exposing a top-level scheduling strategy on the DatasetContext with which a placement group can be given. * [Datasets] Add example of using `map_batches` to filter (#24202) The documentation says > Consider using .map_batches() for better performance (you can implement filter by dropping records). but there aren't any examples of how to do so. * [doc] Add docs for push-based shuffle in Datasets (#24486) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets. * [Doc][Data] fix big-data-ingestion broken links (#24631) The links were broken. Fixed it. * [docs] Fix import error in Ray Data "getting started" (#24424) We did `import pandas as pd` but here we are using it as `pandas` * [Datasets] Overhaul of "Creating Datasets" feature guide. (#24831) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily. * [Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346) * Revamp the Getting Started page for Dataset (#24860) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> Co-authored-by: Eric Liang <ekhliang@gmail.com> * [Datasets] Miscellaneous GA docs P0s. (#24891) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general. * [docs] After careful consideration, choose the lesser of two evils and set white-space: pre-wrap #24873 * [Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR. * [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()). * [Datasets] Add FAQ to Datasets docs. (#24932) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com> * [Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data. * Revamp the Datasets API docstrings (#24949) * Revamp the Saving Datasets user guide (#24987) * Fix AIR references in Datasets FAQ. * [Datasets] Skip flaky pipelining memory release test (#25009) This pipelining memory release test is flaky; it was skipped in this Polars PR, which was then reverted. * Note that explicit resource allocation is experimental, fix typos (#25038) * fix the notebook test failure * no-op indent fix * fix notebooks test #2 * Revamp the Transforming Datasets user guide (#25033) * Fix range_arrow(), which is replaced by range_table() (#25036) * indent * allow empty * Proofread the some datasets docs (#25068) Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> * [Data] Add partitioning classes to Data API reference (#24203) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Robert <xiurobert@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Zhe Zhang <zhz@anyscale.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>

clarkzinzow requested review from ericl, scv119 and jjyao as code owners May 17, 2022 05:14

clarkzinzow assigned ericl, jianoaix and maxpumperla May 17, 2022

jianoaix reviewed May 17, 2022

View reviewed changes

clarkzinzow requested a review from maxpumperla as a code owner May 17, 2022 19:42

clarkzinzow force-pushed the datasets/docs/examples branch from 9137d50 to 16d6876 Compare May 17, 2022 20:46

clarkzinzow assigned pcmoritz May 17, 2022

clarkzinzow force-pushed the datasets/docs/examples branch 2 times, most recently from 3dc3357 to 59a4aed Compare May 17, 2022 22:49

Update .gitignore to ignore notebook checkpoints.

44b3feb

clarkzinzow force-pushed the datasets/docs/examples branch from 59a4aed to 2842ac8 Compare May 18, 2022 00:01

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 18, 2022

richardliaw added this to the Ray AIR milestone May 18, 2022

clarkzinzow force-pushed the datasets/docs/examples branch 2 times, most recently from 918dd77 to b8c3b3a Compare May 18, 2022 20:21

clarkzinzow force-pushed the datasets/docs/examples branch 2 times, most recently from 8951956 to 2b7f66b Compare May 19, 2022 00:23

NYC taxi data example.

8b25929

clarkzinzow force-pushed the datasets/docs/examples branch from 2b7f66b to 8b25929 Compare May 19, 2022 02:05

jianoaix approved these changes May 19, 2022

View reviewed changes

maxpumperla approved these changes May 19, 2022

View reviewed changes

clarkzinzow merged commit 6c0a457 into ray-project:master May 19, 2022

jianoaix added the 1.13.0rc0-pick label May 19, 2022

jianoaix modified the milestones: Ray AIR, Datasets GA May 19, 2022

jianoaix added 1.13.0rc1-pick and removed 1.13.0rc0-pick labels May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add basic e2e Datasets example on NYC taxi dataset #24874

[Datasets] Add basic e2e Datasets example on NYC taxi dataset #24874

clarkzinzow commented May 17, 2022 •

edited

clarkzinzow commented May 17, 2022

ericl commented May 17, 2022 •

edited

clarkzinzow commented May 17, 2022

ericl commented May 17, 2022

clarkzinzow commented May 17, 2022 •

edited

clarkzinzow commented May 18, 2022

ericl commented May 18, 2022

pcmoritz commented May 18, 2022

clarkzinzow commented May 18, 2022

clarkzinzow commented May 19, 2022

clarkzinzow commented May 19, 2022

clarkzinzow commented May 19, 2022

clarkzinzow commented May 19, 2022

maxpumperla left a comment

[Datasets] Add basic e2e Datasets example on NYC taxi dataset #24874

[Datasets] Add basic e2e Datasets example on NYC taxi dataset #24874

Conversation

clarkzinzow commented May 17, 2022 • edited

TODOs

clarkzinzow commented May 17, 2022

ericl commented May 17, 2022 • edited

clarkzinzow commented May 17, 2022

ericl commented May 17, 2022

clarkzinzow commented May 17, 2022 • edited

clarkzinzow commented May 18, 2022

ericl commented May 18, 2022

pcmoritz commented May 18, 2022

clarkzinzow commented May 18, 2022

clarkzinzow commented May 19, 2022

clarkzinzow commented May 19, 2022

clarkzinzow commented May 19, 2022

clarkzinzow commented May 19, 2022

maxpumperla left a comment

Choose a reason for hiding this comment

clarkzinzow commented May 17, 2022 •

edited

ericl commented May 17, 2022 •

edited

clarkzinzow commented May 17, 2022 •

edited