Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sort dataframe logic on qid #239

Merged
merged 12 commits into from
Oct 6, 2022
Merged

Add sort dataframe logic on qid #239

merged 12 commits into from
Oct 6, 2022

Conversation

atomic
Copy link
Contributor

@atomic atomic commented Oct 3, 2022

If qid is provided, xgb.DMatrix requires data to be sorted by qid - see data.cc:
https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/matrix.py#L351

draft to add auto sorting of dataframe if qid is given and dataframe is not already sorted by qid

@atomic atomic changed the title add sort dataframe logic on qid for centralized loader (draft) add sort dataframe logic on qid (draft) Oct 3, 2022
@Yard1
Copy link
Member

Yard1 commented Oct 3, 2022

I wonder if it wouldn't work better to include this logic in _get_dmatrix function in main.py. X and y there are pandas objects, so we can sort X and then apply the index to y. That way we support all cases as the sorting would happen on each actor right before a DMatrix is created.

@atomic
Copy link
Contributor Author

atomic commented Oct 3, 2022

@Yard1 I don't think we can sort cleanly in _get_dmatrix since the columns are already selected out of the dataframe to list.

@atomic
Copy link
Contributor Author

atomic commented Oct 3, 2022

Moved the logic in _split_dataframe in order to apply it to _DistributedRayDMatrixLoader as well.

@Yard1
Copy link
Member

Yard1 commented Oct 3, 2022

_split_dataframe works for me! Can we encapsulate the logic in a separate method for better modularity and call it in _split_dataframe? Thanks, let's see if CI is happy :)

Can we also add a test that would fail without sorted qids? The existing one can be modified to add a case like that.

xgboost_ray/matrix.py Outdated Show resolved Hide resolved
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: atomic <atomic@users.noreply.github.com>
@@ -227,6 +227,10 @@ def _split_dataframe(
`label_upper_bound`

"""
# sort dataframe by qid if exists (required by DMatrix)
if self.qid and not local_data[self.qid].is_monotonic:
local_data = local_data.sort_values([self.qid])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome! so this should be done at each worker, after each parquet is loaded into pdarray and concatenated into local_data and then you sort it by qid? After that you will use it to create dmatrix and do a ray.put to object store?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atomic can you also add a few unit test cases that cover our use case e.g. qids are across multiple parquet files and as long as we make sure per-worker level sorting works this will work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this sorting is done after ray concat all the actor shard's loaded dataframe into one. And yes, the sorted data is then put to ray object store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyitsmui added integ tests for multi-parquet files

Comment on lines 57 to 58
elif isinstance(qid, pd.DataFrame):
_qid = qid.iloc[:, 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we raise an exception if it has more than 1 column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Tony added 2 commits October 4, 2022 22:49
- add logic to include more case of qid data type (array, dataframe)
- add 2 integration tests to cover behavior for sorting qid
@atomic atomic changed the title add sort dataframe logic on qid (draft) add sort dataframe logic on qid Oct 5, 2022
@atomic atomic marked this pull request as ready for review October 5, 2022 05:59
Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @atomic this is great! 2 small nits regarding tests. Can you also run format.sh script in the root repo folder to make lint CI pass? Thanks!

Comment on lines 367 to 380
_ = DMatrix(**{
"data": in_x,
"label": in_y,
"qid": unsorted_qid
})
_ = DMatrix(**{
"data": in_x,
"label": in_y,
"qid": np.sort(unsorted_qid)
}) # no exception
# test RayDMatrix handles sorting automatically
mat = RayDMatrix(in_x, in_y, qid=unsorted_qid)
params = mat.get_data(rank=0, num_actors=1)
_ = DMatrix(**params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ = DMatrix(**{
"data": in_x,
"label": in_y,
"qid": unsorted_qid
})
_ = DMatrix(**{
"data": in_x,
"label": in_y,
"qid": np.sort(unsorted_qid)
}) # no exception
# test RayDMatrix handles sorting automatically
mat = RayDMatrix(in_x, in_y, qid=unsorted_qid)
params = mat.get_data(rank=0, num_actors=1)
_ = DMatrix(**params)
DMatrix(**{
"data": in_x,
"label": in_y,
"qid": unsorted_qid
})
DMatrix(**{
"data": in_x,
"label": in_y,
"qid": np.sort(unsorted_qid)
}) # no exception
# test RayDMatrix handles sorting automatically
mat = RayDMatrix(in_x, in_y, qid=unsorted_qid)
params = mat.get_data(rank=0, num_actors=1)
DMatrix(**params)

label="label",
qid="group")
params = mat.get_data(rank=0, num_actors=1)
_ = DMatrix(**params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ = DMatrix(**params)
DMatrix(**params)

@Yard1 Yard1 changed the title add sort dataframe logic on qid Add sort dataframe logic on qid Oct 5, 2022
@Yard1
Copy link
Member

Yard1 commented Oct 5, 2022

@atomic xgboost_ray/matrix.py:61:80: E501 line too long (99 > 79 characters) (just this left!)

xgboost_ray/tests/test_matrix.py Outdated Show resolved Hide resolved
xgboost_ray/tests/test_matrix.py Outdated Show resolved Hide resolved
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@Yard1
Copy link
Member

Yard1 commented Oct 6, 2022

Cutting edge test failure is the same as on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants