Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve order if necessary when deduping categoricals internally #11597

Conversation

brandon-b-miller
Copy link
Contributor

Closes #11486

@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@8ad0290). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11597   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22998           
  Branches                ?        0           
===============================================
  Hits                    ?    19874           
  Misses                  ?     3124           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@brandon-b-miller brandon-b-miller marked this pull request as ready for review August 26, 2022 14:44
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner August 26, 2022 14:44
@brandon-b-miller brandon-b-miller added bug Something isn't working non-breaking Non-breaking change labels Aug 26, 2022
@brandon-b-miller
Copy link
Contributor Author

rerun tests

.drop_duplicates(ignore_index=True)
._column
)
new_cats = dedup_preserve_order(cudf.Series(new_cats))._column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have dedup_preserve_order operate on a Column rather than a Series? As much as possible, we want to avoid this pattern of constructing a Series out of a column, invoking a Series operation on it, and then getting the column back out of the result.

Having it operate on a column would have the added benefit that we can then make it a method of CategoricalColumn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina those are all good points, I'll update as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to ColumnBase in the end as I ended up not being able to see any reason all columns shouldn't share the capability. Let me know if this should be changed.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -217,6 +217,16 @@ def dropna(self, drop_nan: bool = False) -> ColumnBase:
# The drop_nan argument is only used for numerical columns.
return drop_nulls([self])[0]

def _dedup_preserve_order(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be merged with the unique() method? I'm thinking col.unique() returns unique values in arbitrary order, and col.unique(preserve_order=True) does this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline and came to the conclusion that long term we should implement #11638 to back this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still collapse this _dedup_preserve_order() method into unique()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@shwina shwina self-requested a review August 30, 2022 16:54
@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 488c7ad into rapidsai:branch-22.10 Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Handling of categoricals assumes drop_duplicates preserves order (it doesn't)
3 participants