-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for dask dataframes #99
Conversation
Currently it seems that the problem lies with 1) - |
|
||
# Dask does not support iloc() for row selection, so we have to | ||
# compute a local pandas dataframe first | ||
local_df = data.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this materialize the full distributed dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. However, this part of the code concerns the local loading - i.e. loading of either a partition or centralized loading, where access to the whole dataframe is assumed anyway. This is the part where we end up with a pandas dataframe.
# Pass tuples here (integers can be misinterpreted as row numbers) | ||
ip_to_parts[ip].append((pid, )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why would it be misinterpreted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In load_data()
, we are doing this check:
if indices is not None and len(indices) > 0 and isinstance(
indices[0], Tuple):
Usually, numerical indices would indicate the row numbers a worker should load. This is also valid for Dask (e.g. for centralized loading). However, in this case the numbers should indicate the partition numbers. To not confuse these with row indices, we use a different type. IMO tuples are a good choice because they are immutable and should produce very few overhead.
Note that in Modin we use Ray ObjectIDs instead, but we don't have these for Dask at this point in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great! The main question i have is about some of the implementation details.
Closes #92
Note that the locality-aware scheduling can currently not be tested reliably. This is due to 2 reasons:
Once the
ray.state.objects()
API or something similar comes back, the second option should be easy to enable, and the first option will be easy to confirm.