Fix index difference to follow the pandas format #14789

amiralimi · 2024-01-19T00:14:43Z

Description

This PR fixes an error in Index.difference where the function keeps duplicate elements while pandas removes the duplicates. The tests had no inputs with duplicates, so I added new tests too (I added the test from the original issue).

closes [BUG] Index.difference does not uniquify output for duplicate indexes #14489

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-01-19T00:14:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vyasr · 2024-01-19T21:35:37Z

/ok to test

vyasr · 2024-01-19T21:36:11Z

@wence- could you have a look at this when you get a chance? Thanks!

shwina · 2024-01-23T19:20:40Z

python/cudf/cudf/core/_base_index.py

        else:
            other = other.copy(deep=False)
            difference = cudf.core.index._index_from_data(
-                cudf.DataFrame._from_data({"None": self._column})
+                cudf.DataFrame._from_data({"None": self._column.unique()})
                .merge(
                    cudf.DataFrame._from_data({"None": other._column}),


Should we also call unique() on the right hand side to make the merge smaller?

I tried this with a few test cases, in some it had a better performance, and in some, it was worse. But pandas does this.

OK, I'd just add a note here to alert future readers about potential performance issues:

# NOTE: may need to investigate calling `unique()` on the LHS before the merge for better performance

I am somewhat surprised that calling unique on the right column sometimes improves performance. unique calls stable_distinct which builds a hash table to uniquify things.

The leftanti join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.

So calling unique() on the right column would just seem to be an extra hash-table build for no gain.

Can you show the test cases you ran to check performance @amiralimi ?

shwina · 2024-01-23T20:08:57Z

@amiralimi - this is looking good. Could you run the style check, resolve any style issues found, and push a new commit with the updated style? For this, using pre-commit is helpful:

pre-commit install
pre-commit run --all

amiralimi · 2024-01-23T21:33:20Z

@shwina Thanks for helping out. I just ran pre-commit. This is my first time contributing to an open-source project, is there anything I need to do?

shwina · 2024-01-23T21:38:21Z

/ok to test

shwina · 2024-01-23T21:46:05Z

/ok to test

…x-difference

shwina · 2024-01-24T18:45:36Z

/ok to test

wence-

Thanks, this looks good to me! It would be nice to see the small performance tests you ran to see if uniquifying the right column was helpful.

wence- · 2024-01-25T10:32:22Z

python/cudf/cudf/core/_base_index.py

        else:
            other = other.copy(deep=False)
            difference = cudf.core.index._index_from_data(
-                cudf.DataFrame._from_data({"None": self._column})
+                cudf.DataFrame._from_data({"None": self._column.unique()})
                .merge(
                    cudf.DataFrame._from_data({"None": other._column}),


I am somewhat surprised that calling unique on the right column sometimes improves performance. unique calls stable_distinct which builds a hash table to uniquify things.

The leftanti join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.

So calling unique() on the right column would just seem to be an extra hash-table build for no gain.

Can you show the test cases you ran to check performance @amiralimi ?

amiralimi · 2024-01-25T16:28:33Z

Hi @wence- .
You are correct. I just reran the tests and calling unique on RHS always has worse performance.
This is the test I wrote:

import cudf
import cupy
import time

l1 = cudf.Index(cupy.random.randint(0, 100, 10000000))
r1 = cudf.Index(cupy.random.randint(0, 100, 1000000))

r2 = cudf.Index(cupy.random.randint(0, 1000000, 1000000))
l2 = cudf.Index(cupy.random.randint(0, 1000, 10000000))

l3 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))
r3 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))

l4 = cudf.Index(cupy.random.randint(0, 1000, 10000000))
r4 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))

start = time.time()
l1.difference(r1)
end = time.time()
print(f"test 1: {end - start}")

start = time.time()
l2.difference(r2)
end = time.time()
print(f"test 2: {end - start}")

start = time.time()
l3.difference(r3)
end = time.time()
print(f"test 3: {end - start}")

start = time.time()
l4.difference(r4)
end = time.time()
print(f"test 4: {end - start}")

and this is the output:

// no RHS unique
test 1: 0.04081010818481445
test 2: 0.044335126876831055
test 3: 0.04413151741027832
test 4: 0.01869058609008789

// RHS unique
test 1: 0.04834389686584473
test 2: 0.057317495346069336
test 3: 0.06256484985351562
test 4: 0.03918814659118652

shwina · 2024-01-25T16:40:00Z

/merge

shwina · 2024-01-25T16:40:45Z

Thanks @amiralimi - we're honored to have you contribute to cuDF as your first open-source contribution! Hope we'll see more!

amiralimi · 2024-01-25T17:48:42Z

Thanks @shwina . I really enjoyed working on CuDF and will try to contribute more to it.

wence- · 2024-01-26T16:18:48Z

Thanks!

amiralimi added 2 commits January 18, 2024 17:54

added the tests from the issue on GitHub.

5886152

added .unique() to data so the output is a set.

50bac72

amiralimi requested a review from a team as a code owner January 19, 2024 00:14

amiralimi requested review from vyasr and shwina January 19, 2024 00:14

github-actions bot added the Python Affects Python cuDF API. label Jan 19, 2024

vyasr assigned amiralimi Jan 19, 2024

vyasr added bug Something isn't working non-breaking Non-breaking change labels Jan 19, 2024

shwina reviewed Jan 23, 2024

View reviewed changes

executed pre-commit run --all

ea6ff8b

Whitespace cleanup

e3f1a87

Merge branch 'branch-24.02' of github.com:rapidsai/cudf into fix-inde…

27f8ef9

…x-difference

wence- approved these changes Jan 25, 2024

View reviewed changes

rapids-bot bot merged commit 0cd58fb into rapidsai:branch-24.02 Jan 25, 2024
68 checks passed

amiralimi deleted the fix-index-difference branch January 30, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix index difference to follow the pandas format #14789

Fix index difference to follow the pandas format #14789

amiralimi commented Jan 19, 2024 •

edited by wence-

Loading

copy-pr-bot bot commented Jan 19, 2024

vyasr commented Jan 19, 2024

vyasr commented Jan 19, 2024

shwina Jan 23, 2024

amiralimi Jan 23, 2024

shwina Jan 23, 2024

wence- Jan 25, 2024

shwina commented Jan 23, 2024

amiralimi commented Jan 23, 2024

shwina commented Jan 23, 2024

shwina commented Jan 23, 2024

shwina commented Jan 24, 2024

wence- left a comment

wence- Jan 25, 2024

amiralimi commented Jan 25, 2024

shwina commented Jan 25, 2024

shwina commented Jan 25, 2024

amiralimi commented Jan 25, 2024

wence- commented Jan 26, 2024

Fix index difference to follow the pandas format #14789

Fix index difference to follow the pandas format #14789

Conversation

amiralimi commented Jan 19, 2024 • edited by wence- Loading

Description

Checklist

copy-pr-bot bot commented Jan 19, 2024

vyasr commented Jan 19, 2024

vyasr commented Jan 19, 2024

shwina Jan 23, 2024

Choose a reason for hiding this comment

amiralimi Jan 23, 2024

Choose a reason for hiding this comment

shwina Jan 23, 2024

Choose a reason for hiding this comment

wence- Jan 25, 2024

Choose a reason for hiding this comment

shwina commented Jan 23, 2024

amiralimi commented Jan 23, 2024

shwina commented Jan 23, 2024

shwina commented Jan 23, 2024

shwina commented Jan 24, 2024

wence- left a comment

Choose a reason for hiding this comment

wence- Jan 25, 2024

Choose a reason for hiding this comment

amiralimi commented Jan 25, 2024

shwina commented Jan 25, 2024

shwina commented Jan 25, 2024

amiralimi commented Jan 25, 2024

wence- commented Jan 26, 2024

amiralimi commented Jan 19, 2024 •

edited by wence-

Loading