Skip to content

Use limited API-compatible APIs in arr#600

Merged
rapids-bot[bot] merged 4 commits into
rapidsai:mainfrom
vyasr:limited_api_testing
Mar 3, 2026
Merged

Use limited API-compatible APIs in arr#600
rapids-bot[bot] merged 4 commits into
rapidsai:mainfrom
vyasr:limited_api_testing

Conversation

@vyasr
Copy link
Copy Markdown
Contributor

@vyasr vyasr commented Feb 28, 2026

This PR isolates the changes to only use symbols that are part of the limited API from #564. The second commit includes the test changes that I used to benchmark, which should be reverted

I ran the send_recv benchmark (ucxx-async backend, TAG transfer API, NumPy objects, 1 B payload, 100,000 iterations, TCP transport) across 8 configurations:

  • two progress mode: blocking, polling
  • two code versions: original (without bddf9d6) and limited API (with bddf9d6)
  • two array dimensionalities: linear (without 5904f16) and 3D* (with 5904f16)

I was looking to see if the limited API changes meaningfully affected performance, particularly in the 3D case since the changes were around the array's allocation of shapes, which are more relevant for higher dimensional arrays. What I found was that there are no meaningful performance differences across any of the 8 configurations. Bandwidth and latency numbers are consistent within normal run-to-run variance across all combinations of progress mode, code version, and array dimensionality. In particular, the limited API changes show no measurable overhead on either linear or multi-dimensional arrays. All the differences seem to be within the margin of noise.

Code version Progress mode Array dim Bandwidth (avg) Bandwidth (med) Latency (avg) Latency (med)
original blocking linear 28.19 kiB/s 28.52 kiB/s 34,643 ns 34,238 ns
limited API blocking linear 30.72 kiB/s 32.03 kiB/s 31,793 ns 30,490 ns
original blocking 3d 27.39 kiB/s 27.23 kiB/s 35,652 ns 35,857 ns
limited API blocking 3d 26.79 kiB/s 27.42 kiB/s 36,459 ns 35,610 ns
original polling linear 33.27 kiB/s 34.91 kiB/s 29,349 ns 27,974 ns
limited API polling linear 32.03 kiB/s 33.16 kiB/s 30,492 ns 29,452 ns
original polling 3d 28.53 kiB/s 28.66 kiB/s 34,234 ns 34,078 ns
limited API polling 3d 28.34 kiB/s 28.90 kiB/s 34,461 ns 33,793 ns

Raw Results

blocking · original · linear array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 28.19 kiB/s
Bandwidth (median)        | 28.52 kiB/s
Latency (average)         | 34643 ns
Latency (median)          | 34238 ns
blocking · original · nd array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 27.39 kiB/s
Bandwidth (median)        | 27.23 kiB/s
Latency (average)         | 35652 ns
Latency (median)          | 35857 ns
blocking · limited API · linear array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 30.72 kiB/s
Bandwidth (median)        | 32.03 kiB/s
Latency (average)         | 31793 ns
Latency (median)          | 30490 ns
blocking · limited API · nd array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 26.79 kiB/s
Bandwidth (median)        | 27.42 kiB/s
Latency (average)         | 36459 ns
Latency (median)          | 35610 ns
polling · original · linear array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 33.27 kiB/s
Bandwidth (median)        | 34.91 kiB/s
Latency (average)         | 29349 ns
Latency (median)          | 27974 ns
polling · original · nd array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 28.53 kiB/s
Bandwidth (median)        | 28.66 kiB/s
Latency (average)         | 34234 ns
Latency (median)          | 34078 ns
polling · limited API · linear array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 32.03 kiB/s
Bandwidth (median)        | 33.16 kiB/s
Latency (average)         | 30492 ns
Latency (median)          | 29452 ns
polling · limited API · nd array
Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 28.34 kiB/s
Bandwidth (median)        | 28.90 kiB/s
Latency (average)         | 34461 ns
Latency (median)          | 33793 ns

@vyasr vyasr requested a review from a team as a code owner February 28, 2026 00:29
@vyasr vyasr self-assigned this Feb 28, 2026
@vyasr vyasr added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 28, 2026
Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vyasr for working on this, I'm very much supportive of those changes if we can have the tests fixed and clarify why the changes to the benchmark were made, all those changes at first glance seem undesirable to me and possibly indicative of a regression in Array support. If we can have all tests passing, hopefully without requiring the changes to benchmarks, and can still maintain the performance from the description, I'm +1 on getting these changes in.

Comment on lines +68 to +72
length = ((self.args.n_bytes + 3) // 4) * 4
shape = (length // 4, 2, 2)
if not self.args.enable_am:
if self.args.reuse_alloc and self.args.n_buffers == 1:
reuse_msg = Array(xp.zeros(self.args.n_bytes, dtype="u1"))
reuse_msg = Array(xp.reshape(xp.zeros(length, dtype="u1"), shape))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change with all the cryptic constants and the ultimate reshape? Note we don't do that in other implementations like https://github.com/rapidsai/ucxx/blob/main/python/ucxx/ucxx/benchmarks/backends/ucxx_core.py, and I would very much like preventing comparing apples to oranges in the results.

@pentschev
Copy link
Copy Markdown
Member

Thanks @vyasr for working on this, I'm very much supportive of those changes if we can have the tests fixed and clarify why the changes to the benchmark were made, all those changes at first glance seem undesirable to me and possibly indicative of a regression in Array support. If we can have all tests passing, hopefully without requiring the changes to benchmarks, and can still maintain the performance from the description, I'm +1 on getting these changes in.

Sorry, I should have read the description more attentively, I see what's the reason for the shape changes now. If we can get the tests fixed, I'm definitely fine with the performance results which are essentially unchanged. I would like us not to commit the 3D changes to benchmark, or make it configurable, but if you'd like to make that configurable we should do it to other backends as well, which I think it's not too much worth the trouble.

@vyasr
Copy link
Copy Markdown
Contributor Author

vyasr commented Mar 2, 2026

The failing test is IMHO a bit too strong. It's checking whether the obj under the Array is the same as the obj under the memoryview. The failure is occurring because PyObject_GetBuffer has slightly different behavior than PyMemoryView_FromObject. If passed a memoryview, PyMemoryView_FromObject will pull the underlying object out of the memoryview and put it into the new meoryview, whereas PyObject_GetBuffer will always use the passed in object as the obj member. That means that by using PyObject_GetBuffer the test would instead pass if we did

arr = Array(buffer)
mv = memoryview(obj=buffer)
test_obj = arr.obj
# General solution
while(isinstance(test_obj, memoryview)):
    test_obj = test_obj.obj
# Or, more specifically
test_obj = arr.obj.obj
# Now this will pass
assert obj is mv.obj

IMO changing the test is the right answer since we shouldn't promise that we don't wind up with the object chain happening i.e. mv = memoryview(memoryview(memoryview(arr))) shouldn't be expected to have mv.obj == arr.obj, it should be OK to have to do mv.obj.obj.obj to get out the original object.

However, if we feel strongly otherwise, it's straightforward enough to fix this in the code with the following diff

❯ git diff
diff --git a/python/ucxx/ucxx/_lib/arr.pyx b/python/ucxx/ucxx/_lib/arr.pyx
index 6bf0d00..e3411a5 100644
--- a/python/ucxx/ucxx/_lib/arr.pyx
+++ b/python/ucxx/ucxx/_lib/arr.pyx
@@ -142,6 +142,8 @@ cdef class Array:

                 self.ptr = <uintptr_t>pybuf.buf
                 self.obj = pybuf.obj
+                while isinstance(self.obj, memoryview):
+                    self.obj = (<memoryview>self.obj).obj
                 self.readonly = <bint>pybuf.readonly
                 self.ndim = <Py_ssize_t>pybuf.ndim
                 self.itemsize = <Py_ssize_t>pybuf.itemsize

@pentschev what do you think?

@pentschev
Copy link
Copy Markdown
Member

Thanks for investigating @vyasr . I'm in agreement with your assessment and I'm fine with your proposal of adjusting the test alone. I think the original implementation was done by @jakirkham so I'm pinging here in case he has strong opinions.

@vyasr
Copy link
Copy Markdown
Contributor Author

vyasr commented Mar 2, 2026

Great. I've updated the test and reverted the benchmark changes.

@vyasr vyasr requested a review from pentschev March 2, 2026 23:51
Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Vyas! I'll leave this open until EOD in case @jakirkham has comments, otherwise we're good to go.

@vyasr
Copy link
Copy Markdown
Contributor Author

vyasr commented Mar 3, 2026

/merge

@rapids-bot rapids-bot Bot merged commit a96badb into rapidsai:main Mar 3, 2026
159 of 161 checks passed
@vyasr vyasr deleted the limited_api_testing branch March 3, 2026 23:01
Comment on lines +48 to +49
while isinstance(obj, memoryview):
obj = obj.obj
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memoryviews are actually pretty clever in that they avoid multiple nesting

They do this by making all memoryviews views that share an underlying buffer

We can see this with a simple example

In [1]: b = b"abc"

In [2]: mv1 = memoryview(b)

In [3]: mv2 = memoryview(mv1)

In [4]: mv1.obj is b
Out[4]: True

In [5]: mv2.obj is b
Out[5]: True

So I don't think this while is needed. As it will resolve in the first loop. Though it isn't harming anything (particularly in a test)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that we don't really need a while loop, we just need one level of extraction (see the explanation at #600 (comment)). The top level of extraction is sufficient.

@jakirkham
Copy link
Copy Markdown
Member

It looks like the Conda build is failing on main with this error:

 Processing recipe at path: /__w/ucxx/ucxx/conda/recipes/ucxx/recipe.yaml
Error:   × this recipe uses the deprecated top-level 'cache:' key. The 'cache' format
  │ has been replaced by 'staging' outputs. To automatically migrate your
  │ recipe, run:
  │ 
  │ rattler-build migrate-recipe --recipe /__w/ucxx/ucxx/conda/recipes/ucxx/
  │ recipe.yaml
  │ 
  │ For more information, see: https://rattler-build.prefix.dev/latest/
  │ multiple_output_cache/

@vyasr
Copy link
Copy Markdown
Contributor Author

vyasr commented Mar 4, 2026

It looks like the Conda build is failing on main with this error:

 Processing recipe at path: /__w/ucxx/ucxx/conda/recipes/ucxx/recipe.yaml
Error:   × this recipe uses the deprecated top-level 'cache:' key. The 'cache' format
  │ has been replaced by 'staging' outputs. To automatically migrate your
  │ recipe, run:
  │ 
  │ rattler-build migrate-recipe --recipe /__w/ucxx/ucxx/conda/recipes/ucxx/
  │ recipe.yaml
  │ 
  │ For more information, see: https://rattler-build.prefix.dev/latest/
  │ multiple_output_cache/

Yes, see rapidsai/ci-imgs#376

@jakirkham
Copy link
Copy Markdown
Member

Great thanks Vyas! 🙏

After restarting, CI passed

So this change should now be available in packages

rapids-bot Bot pushed a commit that referenced this pull request Mar 4, 2026
xref rapidsai/build-planning#42

Changes made:
- picked up the changes in #600 so we are now limited API compatible
- split `libucxx` and `ucxx` `rattler-build` recipes into separate folder
  - needed to do this because we have different matrix filters for the two build-types and we need/want to build `cp311` packages using Python 3.11

Ops-Bot-Merge-Barrier: true

Authors:
  - Gil Forsyth (https://github.com/gforsyth)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #564
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants