Add support for dask dataframes #99

krfricke · 2021-05-12T12:29:41Z

Closes #92

Note that the locality-aware scheduling can currently not be tested reliably. This is due to 2 reasons:

We cannot guarantee that dask partitions end up living on a specific node (to set an initial state for redistribution)
We cannot guarantee that tasks that determine the dask partition node are co-scheduled with the respective partition.

Once the ray.state.objects() API or something similar comes back, the second option should be easy to enable, and the first option will be easy to confirm.

krfricke · 2021-05-12T13:20:52Z

Currently it seems that the problem lies with 1) - ray memory shows that each partition lives on the head node after calling persist(). I'll try to find a solution to that.

examples/simple_dask.py

richardliaw · 2021-05-14T00:26:15Z

xgboost_ray/data_sources/dask.py

+
+        # Dask does not support iloc() for row selection, so we have to
+        # compute a local pandas dataframe first
+        local_df = data.compute()


will this materialize the full distributed dataframe?

Yes. However, this part of the code concerns the local loading - i.e. loading of either a partition or centralized loading, where access to the whole dataframe is assumed anyway. This is the part where we end up with a pandas dataframe.

richardliaw · 2021-05-14T00:30:07Z

xgboost_ray/data_sources/dask.py

+        # Pass tuples here (integers can be misinterpreted as row numbers)
+        ip_to_parts[ip].append((pid, ))


hmm, why would it be misinterpreted?

In load_data(), we are doing this check:

if indices is not None and len(indices) > 0 and isinstance( indices[0], Tuple):

Usually, numerical indices would indicate the row numbers a worker should load. This is also valid for Dask (e.g. for centralized loading). However, in this case the numbers should indicate the partition numbers. To not confuse these with row indices, we use a different type. IMO tuples are a good choice because they are immutable and should produce very few overhead.

Note that in Modin we use Ray ObjectIDs instead, but we don't have these for Dask at this point in the code.

richardliaw

Generally looks great! The main question i have is about some of the implementation details.

Kai Fricke added 5 commits May 10, 2021 10:45

Move actor-partition allocation to submodule

0666e90

Add support for dask distributed dataframe

0b70957

Leverage dask on ray

f9e2d1d

Update CPU cluster

5997095

Add dask example, disable data source test

cb10c4c

krfricke requested a review from richardliaw May 12, 2021 12:29

krfricke assigned richardliaw May 12, 2021

richardliaw reviewed May 13, 2021

View reviewed changes

examples/simple_dask.py Show resolved Hide resolved

richardliaw requested a review from amogkam May 13, 2021 07:38

Kai Fricke added 2 commits May 13, 2021 14:45

Use Dask on Ray in example

4ce22fe

Increase timeout to 20 minutes

e5a295e

richardliaw reviewed May 14, 2021

View reviewed changes

richardliaw approved these changes May 14, 2021

View reviewed changes

krfricke merged commit 888cdd1 into ray-project:master May 14, 2021

krfricke deleted the dask-df branch May 14, 2021 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dask dataframes #99

Add support for dask dataframes #99

krfricke commented May 12, 2021

krfricke commented May 12, 2021

richardliaw May 14, 2021

krfricke May 14, 2021

richardliaw May 14, 2021

krfricke May 14, 2021

richardliaw left a comment

		# Pass tuples here (integers can be misinterpreted as row numbers)
		ip_to_parts[ip].append((pid, ))

Add support for dask dataframes #99

Add support for dask dataframes #99

Conversation

krfricke commented May 12, 2021

krfricke commented May 12, 2021

richardliaw May 14, 2021

Choose a reason for hiding this comment

krfricke May 14, 2021

Choose a reason for hiding this comment

richardliaw May 14, 2021

Choose a reason for hiding this comment

krfricke May 14, 2021

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment