Use limited API-compatible APIs in arr by vyasr · Pull Request #600 · rapidsai/ucxx

vyasr · 2026-02-28T00:29:27Z

This PR isolates the changes to only use symbols that are part of the limited API from #564. The second commit includes the test changes that I used to benchmark, which should be reverted

I ran the send_recv benchmark (ucxx-async backend, TAG transfer API, NumPy objects, 1 B payload, 100,000 iterations, TCP transport) across 8 configurations:

two progress mode: blocking, polling
two code versions: original (without bddf9d6) and limited API (with bddf9d6)
two array dimensionalities: linear (without 5904f16) and 3D* (with 5904f16)

I was looking to see if the limited API changes meaningfully affected performance, particularly in the 3D case since the changes were around the array's allocation of shapes, which are more relevant for higher dimensional arrays. What I found was that there are no meaningful performance differences across any of the 8 configurations. Bandwidth and latency numbers are consistent within normal run-to-run variance across all combinations of progress mode, code version, and array dimensionality. In particular, the limited API changes show no measurable overhead on either linear or multi-dimensional arrays. All the differences seem to be within the margin of noise.

Code version	Progress mode	Array dim	Bandwidth (avg)	Bandwidth (med)	Latency (avg)	Latency (med)
original	blocking	linear	28.19 kiB/s	28.52 kiB/s	34,643 ns	34,238 ns
limited API	blocking	linear	30.72 kiB/s	32.03 kiB/s	31,793 ns	30,490 ns
original	blocking	3d	27.39 kiB/s	27.23 kiB/s	35,652 ns	35,857 ns
limited API	blocking	3d	26.79 kiB/s	27.42 kiB/s	36,459 ns	35,610 ns
original	polling	linear	33.27 kiB/s	34.91 kiB/s	29,349 ns	27,974 ns
limited API	polling	linear	32.03 kiB/s	33.16 kiB/s	30,492 ns	29,452 ns
original	polling	3d	28.53 kiB/s	28.66 kiB/s	34,234 ns	34,078 ns
limited API	polling	3d	28.34 kiB/s	28.90 kiB/s	34,461 ns	33,793 ns

Raw Results

blocking · original · linear array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 28.19 kiB/s
Bandwidth (median)        | 28.52 kiB/s
Latency (average)         | 34643 ns
Latency (median)          | 34238 ns

blocking · original · nd array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 27.39 kiB/s
Bandwidth (median)        | 27.23 kiB/s
Latency (average)         | 35652 ns
Latency (median)          | 35857 ns

blocking · limited API · linear array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 30.72 kiB/s
Bandwidth (median)        | 32.03 kiB/s
Latency (average)         | 31793 ns
Latency (median)          | 30490 ns

blocking · limited API · nd array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | blocking
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 26.79 kiB/s
Bandwidth (median)        | 27.42 kiB/s
Latency (average)         | 36459 ns
Latency (median)          | 35610 ns

polling · original · linear array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 33.27 kiB/s
Bandwidth (median)        | 34.91 kiB/s
Latency (average)         | 29349 ns
Latency (median)          | 27974 ns

polling · original · nd array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 28.53 kiB/s
Bandwidth (median)        | 28.66 kiB/s
Latency (average)         | 34234 ns
Latency (median)          | 34078 ns

polling · limited API · linear array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 32.03 kiB/s
Bandwidth (median)        | 33.16 kiB/s
Latency (average)         | 30492 ns
Latency (median)          | 29452 ns

polling · limited API · nd array

Roundtrip benchmark
================================================================================
Iterations                | 100000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-async
Transfer API              | TAG
Progress mode             | polling
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 28.34 kiB/s
Bandwidth (median)        | 28.90 kiB/s
Latency (average)         | 34461 ns
Latency (median)          | 33793 ns

pentschev

Thanks @vyasr for working on this, I'm very much supportive of those changes if we can have the tests fixed and clarify why the changes to the benchmark were made, all those changes at first glance seem undesirable to me and possibly indicative of a regression in Array support. If we can have all tests passing, hopefully without requiring the changes to benchmarks, and can still maintain the performance from the description, I'm +1 on getting these changes in.

pentschev · 2026-03-01T17:30:24Z

+            length = ((self.args.n_bytes + 3) // 4) * 4
+            shape = (length // 4, 2, 2)
            if not self.args.enable_am:
                if self.args.reuse_alloc and self.args.n_buffers == 1:
-                    reuse_msg = Array(xp.zeros(self.args.n_bytes, dtype="u1"))
+                    reuse_msg = Array(xp.reshape(xp.zeros(length, dtype="u1"), shape))


Why do we need this change with all the cryptic constants and the ultimate reshape? Note we don't do that in other implementations like https://github.com/rapidsai/ucxx/blob/main/python/ucxx/ucxx/benchmarks/backends/ucxx_core.py, and I would very much like preventing comparing apples to oranges in the results.

pentschev · 2026-03-01T17:36:20Z

Thanks @vyasr for working on this, I'm very much supportive of those changes if we can have the tests fixed and clarify why the changes to the benchmark were made, all those changes at first glance seem undesirable to me and possibly indicative of a regression in Array support. If we can have all tests passing, hopefully without requiring the changes to benchmarks, and can still maintain the performance from the description, I'm +1 on getting these changes in.

Sorry, I should have read the description more attentively, I see what's the reason for the shape changes now. If we can get the tests fixed, I'm definitely fine with the performance results which are essentially unchanged. I would like us not to commit the 3D changes to benchmark, or make it configurable, but if you'd like to make that configurable we should do it to other backends as well, which I think it's not too much worth the trouble.

vyasr · 2026-03-02T21:30:12Z

The failing test is IMHO a bit too strong. It's checking whether the obj under the Array is the same as the obj under the memoryview. The failure is occurring because PyObject_GetBuffer has slightly different behavior than PyMemoryView_FromObject. If passed a memoryview, PyMemoryView_FromObject will pull the underlying object out of the memoryview and put it into the new meoryview, whereas PyObject_GetBuffer will always use the passed in object as the obj member. That means that by using PyObject_GetBuffer the test would instead pass if we did

arr = Array(buffer)
mv = memoryview(obj=buffer)
test_obj = arr.obj
# General solution
while(isinstance(test_obj, memoryview)):
    test_obj = test_obj.obj
# Or, more specifically
test_obj = arr.obj.obj
# Now this will pass
assert obj is mv.obj

IMO changing the test is the right answer since we shouldn't promise that we don't wind up with the object chain happening i.e. mv = memoryview(memoryview(memoryview(arr))) shouldn't be expected to have mv.obj == arr.obj, it should be OK to have to do mv.obj.obj.obj to get out the original object.

However, if we feel strongly otherwise, it's straightforward enough to fix this in the code with the following diff

❯ git diff
diff --git a/python/ucxx/ucxx/_lib/arr.pyx b/python/ucxx/ucxx/_lib/arr.pyx
index 6bf0d00..e3411a5 100644
--- a/python/ucxx/ucxx/_lib/arr.pyx
+++ b/python/ucxx/ucxx/_lib/arr.pyx
@@ -142,6 +142,8 @@ cdef class Array:

                 self.ptr = <uintptr_t>pybuf.buf
                 self.obj = pybuf.obj
+                while isinstance(self.obj, memoryview):
+                    self.obj = (<memoryview>self.obj).obj
                 self.readonly = <bint>pybuf.readonly
                 self.ndim = <Py_ssize_t>pybuf.ndim
                 self.itemsize = <Py_ssize_t>pybuf.itemsize

@pentschev what do you think?

pentschev · 2026-03-02T21:55:58Z

Thanks for investigating @vyasr . I'm in agreement with your assessment and I'm fine with your proposal of adjusting the test alone. I think the original implementation was done by @jakirkham so I'm pinging here in case he has strong opinions.

This reverts commit 5904f16.

vyasr · 2026-03-02T23:51:31Z

Great. I've updated the test and reverted the benchmark changes.

pentschev

LGTM, thanks Vyas! I'll leave this open until EOD in case @jakirkham has comments, otherwise we're good to go.

vyasr · 2026-03-03T23:01:04Z

/merge

jakirkham · 2026-03-04T00:20:36Z

+    while isinstance(obj, memoryview):
+        obj = obj.obj


memoryviews are actually pretty clever in that they avoid multiple nesting

They do this by making all memoryviews views that share an underlying buffer

We can see this with a simple example

In [1]: b = b"abc" In [2]: mv1 = memoryview(b) In [3]: mv2 = memoryview(mv1) In [4]: mv1.obj is b Out[4]: True In [5]: mv2.obj is b Out[5]: True

So I don't think this while is needed. As it will resolve in the first loop. Though it isn't harming anything (particularly in a test)

You're right that we don't really need a while loop, we just need one level of extraction (see the explanation at #600 (comment)). The top level of extraction is sufficient.

jakirkham · 2026-03-04T00:24:04Z

It looks like the Conda build is failing on main with this error:

 Processing recipe at path: /__w/ucxx/ucxx/conda/recipes/ucxx/recipe.yaml
Error:   × this recipe uses the deprecated top-level 'cache:' key. The 'cache' format
  │ has been replaced by 'staging' outputs. To automatically migrate your
  │ recipe, run:
  │ 
  │ rattler-build migrate-recipe --recipe /__w/ucxx/ucxx/conda/recipes/ucxx/
  │ recipe.yaml
  │ 
  │ For more information, see: https://rattler-build.prefix.dev/latest/
  │ multiple_output_cache/

vyasr · 2026-03-04T00:27:21Z

It looks like the Conda build is failing on main with this error:

 Processing recipe at path: /__w/ucxx/ucxx/conda/recipes/ucxx/recipe.yaml
Error:   × this recipe uses the deprecated top-level 'cache:' key. The 'cache' format
  │ has been replaced by 'staging' outputs. To automatically migrate your
  │ recipe, run:
  │ 
  │ rattler-build migrate-recipe --recipe /__w/ucxx/ucxx/conda/recipes/ucxx/
  │ recipe.yaml
  │ 
  │ For more information, see: https://rattler-build.prefix.dev/latest/
  │ multiple_output_cache/

Yes, see rapidsai/ci-imgs#376

jakirkham · 2026-03-04T00:45:45Z

Great thanks Vyas! 🙏

After restarting, CI passed

So this change should now be available in packages

xref rapidsai/build-planning#42 Changes made: - picked up the changes in #600 so we are now limited API compatible - split `libucxx` and `ucxx` `rattler-build` recipes into separate folder - needed to do this because we have different matrix filters for the two build-types and we need/want to build `cp311` packages using Python 3.11 Ops-Bot-Merge-Barrier: true Authors: - Gil Forsyth (https://github.com/gforsyth) Approvers: - Bradley Dice (https://github.com/bdice) URL: #564

vyasr added 2 commits February 27, 2026 16:01

Use limited API in arr

bddf9d6

Temporarily modify benchmarking for testing

5904f16

vyasr requested a review from a team as a code owner February 28, 2026 00:29

vyasr self-assigned this Feb 28, 2026

vyasr added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 28, 2026

pentschev reviewed Mar 1, 2026

View reviewed changes

vyasr added 2 commits March 2, 2026 15:50

Update test

a7ce8c6

Revert "Temporarily modify benchmarking for testing"

c6b9d15

This reverts commit 5904f16.

vyasr requested a review from pentschev March 2, 2026 23:51

pentschev approved these changes Mar 3, 2026

View reviewed changes

rapids-bot Bot merged commit a96badb into rapidsai:main Mar 3, 2026
159 of 161 checks passed

vyasr deleted the limited_api_testing branch March 3, 2026 23:01

jakirkham reviewed Mar 4, 2026

View reviewed changes

gforsyth mentioned this pull request Mar 4, 2026

feat: build wheels and conda packages using the limited API #564

Merged

vyasr mentioned this pull request Apr 16, 2026

Add support for Python free-threading builds rapidsai/build-planning#174

Open

Conversation

vyasr commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

pentschev Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

pentschev commented Mar 1, 2026

Uh oh!

vyasr commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev commented Mar 2, 2026

Uh oh!

vyasr commented Mar 2, 2026

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

vyasr commented Mar 3, 2026

Uh oh!

Uh oh!

jakirkham Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

vyasr Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Mar 4, 2026

Uh oh!

vyasr commented Mar 4, 2026

Uh oh!

jakirkham commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vyasr commented Feb 28, 2026 •

edited

Loading

vyasr commented Mar 2, 2026 •

edited

Loading