-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sort dataframe logic on qid #239
Conversation
I wonder if it wouldn't work better to include this logic in |
@Yard1 I don't think we can sort cleanly in |
Moved the logic in |
Can we also add a test that would fail without sorted qids? The existing one can be modified to add a case like that. |
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: atomic <atomic@users.noreply.github.com>
xgboost_ray/matrix.py
Outdated
@@ -227,6 +227,10 @@ def _split_dataframe( | |||
`label_upper_bound` | |||
|
|||
""" | |||
# sort dataframe by qid if exists (required by DMatrix) | |||
if self.qid and not local_data[self.qid].is_monotonic: | |||
local_data = local_data.sort_values([self.qid]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome! so this should be done at each worker, after each parquet is loaded into pdarray and concatenated into local_data
and then you sort it by qid
? After that you will use it to create dmatrix
and do a ray.put
to object store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atomic can you also add a few unit test cases that cover our use case e.g. qids are across multiple parquet files and as long as we make sure per-worker level sorting works this will work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this sorting is done after ray concat all the actor shard's loaded dataframe into one. And yes, the sorted data is then put to ray object store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heyitsmui added integ tests for multi-parquet files
xgboost_ray/matrix.py
Outdated
elif isinstance(qid, pd.DataFrame): | ||
_qid = qid.iloc[:, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we raise an exception if it has more than 1 column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
- add logic to include more case of qid data type (array, dataframe) - add 2 integration tests to cover behavior for sorting qid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @atomic this is great! 2 small nits regarding tests. Can you also run format.sh
script in the root repo folder to make lint CI pass? Thanks!
xgboost_ray/tests/test_matrix.py
Outdated
_ = DMatrix(**{ | ||
"data": in_x, | ||
"label": in_y, | ||
"qid": unsorted_qid | ||
}) | ||
_ = DMatrix(**{ | ||
"data": in_x, | ||
"label": in_y, | ||
"qid": np.sort(unsorted_qid) | ||
}) # no exception | ||
# test RayDMatrix handles sorting automatically | ||
mat = RayDMatrix(in_x, in_y, qid=unsorted_qid) | ||
params = mat.get_data(rank=0, num_actors=1) | ||
_ = DMatrix(**params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ = DMatrix(**{ | |
"data": in_x, | |
"label": in_y, | |
"qid": unsorted_qid | |
}) | |
_ = DMatrix(**{ | |
"data": in_x, | |
"label": in_y, | |
"qid": np.sort(unsorted_qid) | |
}) # no exception | |
# test RayDMatrix handles sorting automatically | |
mat = RayDMatrix(in_x, in_y, qid=unsorted_qid) | |
params = mat.get_data(rank=0, num_actors=1) | |
_ = DMatrix(**params) | |
DMatrix(**{ | |
"data": in_x, | |
"label": in_y, | |
"qid": unsorted_qid | |
}) | |
DMatrix(**{ | |
"data": in_x, | |
"label": in_y, | |
"qid": np.sort(unsorted_qid) | |
}) # no exception | |
# test RayDMatrix handles sorting automatically | |
mat = RayDMatrix(in_x, in_y, qid=unsorted_qid) | |
params = mat.get_data(rank=0, num_actors=1) | |
DMatrix(**params) |
xgboost_ray/tests/test_matrix.py
Outdated
label="label", | ||
qid="group") | ||
params = mat.get_data(rank=0, num_actors=1) | ||
_ = DMatrix(**params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ = DMatrix(**params) | |
DMatrix(**params) |
@atomic |
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Cutting edge test failure is the same as on master |
If qid is provided, xgb.DMatrix requires data to be sorted by qid - see data.cc:
https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/matrix.py#L351
draft to add auto sorting of dataframe if qid is given and dataframe is not already sorted by qid