[RFC] libibverbs: Add UET-verbs #1

shefty · 2025-08-06T22:17:46Z

For early review of header changes only. See individual commits for details.

The reg_mr_attr patch needs to integrate with reg_mr_ex, but because reg_mr_ex actually named its parameter as input, additional work is needed.

Note that one of the patches references struct uet_addr. That's defined in a UEC spec, and I haven't copied the details into patch form yet. But just assume it contains the addressing needed for a UET QP.

Add a flag to query directly the RDMA device support for QP data-in-order semantics without enforcing host CPU architecture restrictions. It is particularly useful in scenarios where the GPU performs data polling directly, with the application responsible for ensuring the GPU side support for data-in-order semantics. Reviewed-by: Michael Margolin <mrgolin@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Daniel Kranzdorf <dkkranzd@amazon.com>

For the new polling API add an option to dynamically choose the CQ basic polling functions: start, next and end. This will allow for different optimizations with the first one being CQs with a single sub CQ. With this type of CQ we have an overhead of function calls and redundant for loop that we can drop. Signed-off-by: Yonatan Nachum <ynachum@amazon.com>

Add an option to create a single threaded CQ, if a single threaded CQ is created the CQ lock isn't taken on completion polling functions. Signed-off-by: Yonatan Nachum <ynachum@amazon.com>

Rename the structure and input parameter to align with other libibverbs API calls. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Add vendor error code definitions and corresponding work completion status mappings for retry and rnr timeout exceeded errors. Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>

Add missing TX capability check in test_flow_rdma_transport_domain_traffic. The test only validated RX flow table support, but should check both RX and TX flow table capabilities. Fixes: fb81142 ("tests: Add test to cover RDMA transport domain") Signed-off-by: Daniel Hayon <dhayon@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Set the appropriate CUDA flag when creating FD for mlx5 DMABUF. This flag is mandatory for data-direct traffic. Signed-off-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Field relaxed_ordering_read is depreceted in the older position and was changed to relaxed_ordering_read_pci_enabled. New field was added for relaxed_ordering_read. Signed-off-by: Elyashiv Cohen <elyashivc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add all necessary support for DevX events in pyverbs. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add tests for DevX events with cq error. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Extend Mlx5FlowActionAttr to support counters. Adjust QP action setter in Mlx5FlowActionAttr. Adjust type attribute to support more actions. Add Mlx5ModifyFlowAction action. Extend Mlx5Flow to support counters. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add counter test for FW steering in mlx5_flow tests. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Extend the mlx5dv_create_flow API to support bulk counter operations by introducing a new action type MLX5DV_FLOW_ACTION_COUNTERS_DEVX_WITH_OFFSET. This allows users to specify an offset within DEVX counter objects for more granular bulk counter object management. The implementation removes the previous auxiliary array approach (_mlx5dv_create_flow with actions_attr_aux parameter) in favor of a cleaner design that embeds offset information directly within the flow action structure. The mlx5dv_flow_action_attr union is extended with a bulk_obj member containing both the DEVX object and an offset, allowing also external rdma-core applications to use DEVX bulk counter via the offset. Existing applications using MLX5DV_FLOW_ACTION_COUNTERS_DEVX continue to work unchanged, while new applications can use the enhanced MLX5DV_FLOW_ACTION_COUNTERS_DEVX_WITH_OFFSET for bulk counter scenarios. Note that no kernel changes needed, since DEVX bulk counter object with offset is already supported. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>

In mlx5_alloc_td(), check if blueflame is supported by examining ctx->bf_reg_size before attempting UAR allocation. When blueflame is not supported (bf_reg_size == 0), fallback to using the shared nc (non-cached) UAR instead of trying to allocate a dedicated UAR. This prevents unnecessary dedicated UAR allocation attempts on devices that don't support blueflame, while ensuring td allocation succeeds by using the available non-cached singleton UAR. In mlx5_dealloc_td(), only detach dedicated UARs by checking the singleton flag to avoid incorrectly freeing the shared nc_uar. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>

Add CX9 to MLX5_DEVS list. Signed-off-by: Elyashiv Cohen <elyashivc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add new pyverbs APIs to support DevX async command completion, including functions to create, get, and destroy async command completion objects. Signed-off-by: Linoy Ganti <lganti@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add a new test to verify async command completion support in DevX. This test covers the creation of a DevX QP, issuing an asynchronous query, and validating the completion and results. Signed-off-by: Linoy Ganti <lganti@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

- Create a shared helper method `rdma_transport_domain_test` to handle common test setup. - Update both test methods to use the helper with their specific priority values. Signed-off-by: Daniel Hayon <dhayon@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Since new behavior in the Linux driver now allows unprivileged users additional operations, requires_root was replaced with requires_cap for the relevant cases. Signed-off-by: Shlomo Assaf <sassaf@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add new test test_create_ud_qp_with_privileged_qkey to cover privileged QKEY (0x80000000) functionality for UD QPs. The test verifies that the privileged QKEY can be set and queried correctly for both legacy QPs (created via ibv_create_qp) and extended QPs (created via ibv_create_qp_ex). The test ensures proper handling of IB_QP_PRIVILEGED_Q_KEY by: - Creating UD QPs using both legacy and extended creation methods - Setting the privileged QKEY value during QP initialization - Querying the QKEY attribute to verify it was set correctly - Validating that both creation paths handle privileged QKEYs properly This test requires CAP_NET_RAW capability as privileged QKEYs are restricted to privileged users. Signed-off-by: Shlomo Assaf <sassaf@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Extend the mlx5dv_create_flow API to support bulk counter operations by introducing a new action type MLX5DV_FLOW_ACTION_COUNTERS_DEVX_WITH_OFFSET. This allows users to specify an offset within DEVX counter objects for more granular bulk counter object management. The implementation removes the previous auxiliary array approach (_mlx5dv_create_flow with actions_attr_aux parameter) in favor of a cleaner design that embeds offset information directly within the flow action structure. The mlx5dv_flow_action_attr union is extended with a bulk_obj member containing both the DEVX object and an offset, allowing also external rdma-core applications to use DEVX bulk counter via the offset. Existing applications using MLX5DV_FLOW_ACTION_COUNTERS_DEVX continue to work unchanged, while new applications can use the enhanced MLX5DV_FLOW_ACTION_COUNTERS_DEVX_WITH_OFFSET for bulk counter scenarios. Note that no kernel changes needed, since DEVX bulk counter object with offset is already supported. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add new action to flow API to support flow counters with offset. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add test_counters_bulk_flow to test flow action counter with offset. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

1. Add support to MR extended. 2. Add DMA Handle class. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Add new tests to cover MREX and DMAHandle which represents new verbs API ibv_reg_mr_ex and ibv_alloc_dmah. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

To commit: 810f874eda8e ("RDMA/ucma: Support query resolved service records"). Signed-off-by: Sean Hefty <shefty@nvidia.com>

mlx5: Misc. improvements

…ation environments across three layers of networks (different subnets) Fixed the issue where, in an RDMA communication environment across three layers of networks (different subnets), when using a global routing header (GRH) to establish an address handle (AH), the data packet could not be transmitted across the router due to the hard coded setting of ah_ attr. grh. hop limit to 1, resulting in RDMA communication failure across network nodes. Now adjust ah_ attr. grh. hop limit to 64 Co-authored-by: kanyong <kanyong@kylinos.cn> Co-authored-by: ningjin <ningjin@kylinos.cn> Co-authored-by: liangchangwei <liangchangwei@kylinos.cn> Co-authored-by: daiyanlong <daiyanlong@kylinos.cn> Signed-off-by: daiyanlong <daiyanlong@kylinos.cn>

To support RawEth QP: - Added flow verbs create_fow/destroy_flow to roce lib These flow verbs are used to manage flows on RawEh QP. - Added CQE processing for completion type BNXT_RE_WC_TYPE_RECV_RAW. Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>

efa: Extend DV query CQ to return doorbell

DirectWQE fields are not assigned or cleared explicitly when DirectWQE not used. When QP wraps around, data in these fields from the previous use at the same position still remains and are issued to HW by mistake. Clear these fields before issuing doorbell to HW. Fixes: 159933c ("libhns: Add support for direct wqe") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>

Clean up an extra blank line. Fixes: b479323 ("libhns: Fix the sge num problem of atomic op") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

libhns: Minor fix and cleanup

mlx5: Fix byte_count type in umr_sg_list_create

pyverbs: Release Python GIL when calling blocking CMID functions

A user of libibverbs must rely heavily on external documentation, specifically the IBTA vol. 1 specification, to understand how the API is used. However, the API itself has evolved beyond support for only Infiniband. This leaves both users and potential vendors trying to plug into the API struggling, as the names used by the library reflect Infiniband naming, but the concepts have broader use. To provide better guidance on what the current verbs semantic model describes, provide documentation on how major verbs constructs are used. This includes referencing the historical meaning of verbs objects, as well as their evolved use. The proposed descriptions are directly intended to help new transports, such as Ultra Ethernet, understand how to adopt verbs for best results and where potential changes may be needed. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Ultra ethernet is a new connectionless transport that targets HPC and AI applications running at extreme scale. Introduce new node and transport types for devices that only support the new ultra ethernet transport. UET may be layered over UDP/IP using a well-known UDP port (similar to RoCEv2), or may be layered directly over IP. Define new GID types to allow users to select UET plus the underlying protocol layering (similar to how RoCEv1 and RoCEv2 are handled). Signed-off-by: Sean Hefty <shefty@nvidia.com>

UET is designed around connectionless communication. To expose UET through verbs, we introduce a new reliable- unconnected QP type (named to align with existing QP types). Infiniband defines several states that a QP may be in. Many of the states are unsuitable for unconnected QPs in general and may not irrevelent depending on HW implementations. For UET, we define only 2 states for a UET QP: RTS and error. A UET QP is created in the ready-to-send state. To create a UET QP directly into the RTS state, the full set of QP attributes are needed at creation time. Struct ibv_qp_init_attr_ex is extended to include struct ibv_qp_attr for this purpose. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Job IDs are used to identify a distributed application. The concept is widely used in HPC and AI applications, to identify a set of distributed processes as belonging to a single application. Job IDs are integral to ultra ethernet. A job ID is carried in every transport message and is part of a UET QP address. UEC defines that job IDs must be managed by a privileged entity. The association of a job ID to a specific QP is a protected operation. A simple view of the job security model is shown as this object model: device <--- job ID ^ ^ | | PD <--- job key ^ ^ ^ | \___ | (optional) QP --- MR This patch focuses on the job ID. Job keys are discussed in a following patch. We define new verb calls to allocate a job object. Each job object is assigned a unique ID. The assignment of ID values to job objects it outside the scope of the API, and would usually be handled through a job launcher or process manager. The ibv_alloc_job() call is use to create and configure a job object. It is expected that the kernel will enforce that callers have the proper privileges to create job objects on devices. (Similar to opening QP 0 or 1). Once a job object has been created, it may be shared with local processes using a shared fd mechanism. The creating process obtains a sharable fd using ibv_export_job() and exchanges the fd with the processes of the job (e.g. via sockets). On receiving the fd, the processes use ibv_import_job() to setup local job resources. A job is associated with addressing information, which includes protocol stack data, as well as an ID. The number of bytes of the ID which are valid is dependent on the associated protocol. For UET, it is 3-bytes. A job object performs an additional function beyond linking a QP with a job ID. It defines a mechanism by which local processes can share addressing information of peers. This can reduce the amount of memory used to store addresses locally and enables future optimizations, such as applying job level encryption. The feature will also map well to HPC and AI applications that identify peers using a rank. Conceptually, a virtual address array may be stored with a job object. Addresses are inserted or removed from the array at a given index location. The intent is that the index can map directly to the process' rank. When sending to a peer, the peer can be identified by the job plus the index. Note that the implementation for the job's addressing array is not defined. A vendor may implement this in a variety of ways. Addresses may be pre-inserted by the job launcher, and the transport addresses may be generated using an algorithm. Signed-off-by: Sean Hefty <shefty@nvidia.com>

The job object model can be viewed as: device <--- job ID ^ ^ | | PD <--- job key ^ ^ ^ | \___ | (optional) QP --- MR This patch introduces the job key object. The relationship between a job key and a job ID is similar to an lkey to a MR. A job object maps to a job ID value. Job objects are device level objects. A job key associates the job ID with a protection domain to provide process level protections. Job keys are associated with a 32-bit jkey value. The jkey will be used when posting a WR to associate a transfer with a specific job. That is, the jkey is what mirrors the lkey concept. The NIC converts the jkey to the job ID when transmitting packets on the wire, applying appropriate checks that the QP has access to the target job ID. E.g. the job key and QP belong to the same PD. UET allows a registered MR to optionally be accessible only to members of a specific job. The job key will also be used as an optional attribute when creating a MR. Details on associating a MR with a job key are defined in a later patch. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Add new extended QP functions to set necessary input fields related to supporting RU QPs and UE transport. The UE transport supports 64-bits of immediate data and 64-bit rkeys. Provide expanded APIs to support both. Also include APIs to set full UET destination address data. UET QPs have an additional address component beyond the QP or endpoint address. They have a concept defined as a resource index. A resource index can be viewed as additional receive queues attached to the QP, which are directly addressable by a sender. One intended use of resource indices is to allow a single UET QP to separate traffic from different services. For example, HPC traffic may use one subset of indices, AI traffic a different subset, and storage a third. The number of resource indices supported by a QP is vendor specific, and how they are used by applications it outside the scope of the verbs API. The resource index concept reuses the verbs work queue concept A new send WR flag is also added, delivery complete. When requested and supported by the provider, this flag indicates that a completion for the send operation indicates that the data is globally observable at the target. This is an optional feature of the UE transport. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Allow UET specific information to be reported as part of work completions. This includes the larger immediate data size, the job ID carried in the transport header, and a peer ID, also carried in the transport header. Included with completion data is a UET transport field, called the initiator in UEC terminology. This is a user configurable value intended to map to the rank number for a parallel application. The initiator field only has meaning within a specific job ID. As a result, when the value is valid in a completion, so is the job ID. (For UET, the initiator value is part of the UET address.) The verbs naming of this field is the slightly more generic term, src_id, to align with src_qpn (in ibv_wc). Signed-off-by: Sean Hefty <shefty@nvidia.com>

The UET protocol and devices support advanced features for memory regions. From the viewpoint of the protocol, an rkey is 64-bits, with specific meaning applied to several of the bits. Struct ibv_mr is extended to report a 64-bit rkey. Providers are expected to set the 32-bit rkey and/or rkey64 field in struct ibv_mr correctly based on the transports supported by the device. A second protocol feature is that a MR may be restricted to being accessible by a specific job. Since a UET QP may be used to communicate with multiple jobs simultaneously, the memory registration call is expanded to allow associating a job key with a MR. Signed-off-by: Sean Hefty <shefty@nvidia.com>

UET defines multiple packet delivery modes: ROD - reliable, ordered delivery RUD - reliable, unordered delivery RUDI - reliable, unordered delivery for idempotent transfers UUD - unreliable, unordered delivery The packet delivery modes impact how out of order packets are handled at the receiver, retry mechanisms, multi-pathing support, and congestion control algorithms, among other behavior. A single UET QP may use multiple packet delivery modes simultaneously based on the application data transfer being performed. Even traditional RDMA protocols are evolving to allow greater flexibility in how message and data ordering are delivered at the receiver. This patch introduces a new QP attribute structure called QP semantics. This structure defines the message and data ordering requirements that a QP must implement. If a QP cannot meet the requested semantics, QP creation should fail, but a vendor can always provide stronger guarantees than those requested by the user. QP semantics indicate if the QP must provider message and data ordering guarantees, such as write-after-write, read- after-write, send-after-write, etc. Traditionally, these ordering guarantees were defined by the relevent RDMA specifications, and users of the libibverbs API needed to know to reference those specs in order to use a QP correctly (such as when to fence data transfers). As an alternative, a new device level query call is added, which can return the supported ordering guarantees for a given QP type over a specific transport. The QP semantics may optionally be passed into the create QP operation. After querying for supported semantics, applications can remove unneeded ordering guarantees in order to leverage available network features (such as multipath support). This allows vendors to adjust transport behavior accordingly. For example, UET can leverage ROD when sending messages, but use RUD or RUDI for RDMA transfers. Data ordering between messages is further defined by to indicate the maximum size transfer that ordering holds. For example, RDMA write-after-read ordering may be restricted to single MTU transfers. Finally, as a 'fix' to MTU sizes forced to being a power of 2, a max_pdu is introduced. The max PDU reports the maximum size of *user* data that can be carried in a single transport packet. The max PDU is relative to the port MTU, minus protocol headers. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Legacy RDMA transports are restricted to 32-bits of immediate data, while UET supports 64-bits. Additionally, UET does not require that RDMA writes with immediate consume a posted receive buffer at the target. The spec even goes so far as to mandate that RDMA traffic be treated separately at the target than send operations; however, such a mandate is not visible in the transport and places restrictions on the NIC implementation. NICs that support multiple protocols, including UET, may be optimized for legacy RDMA support. For example, CQ entries may only be able to store 32-bits of immediate data. To handle different implementations and transports, we extend the QP semantic structure to report the immediate data size, as well as implementation constraints, such as the need to consume a posted receive buffer. This change has an added advantage that it is now possible for a user to indicate that immediate data will not be used by setting the size to 0 when creating the QP. For devices which support a smaller immediate data size than that carried by the transport, truncated immediate data is extended with 0s when writing to the wire, and completions report the lowest valid bits. The QP semantics are extended with a new use_flags. These flags will allow providers to direct applications on constraints on using the HW, allowing greater flexibility in implementations. When set, IBV_QP_USAGE_IMM_DATA_RQ indicates that RDMA writes with immediate data will consume a posted receive buffer on the QP. This is standard behavior for legacy RDMA transports, but not for UET. By setting this flag, a provider can indicate this as their default requirement even when using UET QPs. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Legacy RDMA devices immediately expose a new MR as soon as the memory registration process completes. That is, even before reg_mr() returns to the caller, the region is accessible to any QP sharing the same PD. UET allows for greater control over access to a MR. Even once a MR has been created, exposure to the MR is treated as a separate operation. This further allows access to a MR to be invoked without it being destroyed, which enables a MR to be used-once. E.g. The MR may be the target of a single RDMA operation, with access controlled by the owner of the MR. This behavior differs from the remote invalidate operation. To support this additional level of control, we introduce new QP operations: attach MR and detach MR. A provider indicates that MRs must be explicitly attached to a QP through a new QP usage flag, as this behavior may be specific to a given transport protocol + QP type. E.g. UET + RU QPs may support MR attachment, but UET + UD QPs may not (since the feature is not required). Support and the need to attach a MR to a QP is indicated by the IBV_QP_USAGE_ATTACH_MR usage flag. Signed-off-by: Sean Hefty <shefty@nvidia.com>

UET allows for user selected rkey values to improve scalability. Expose support via a device capability flag and update memory registration accordingly. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Introduce a concept called derived memory regions. Derived MRs are similar to legacy RDMA memory windows, but setup through the memory registration API, rather than post send. Derived MRs are new MRs that are wholy contained within an existing MR (to share page mappings, for example), but have different access rights or other attributes. For UET, a derived MR allows a MR to be associated with different jobs, with the access for each job to be different, while still being able to share the underlying HW page mappings. Applications must assume that a derived MR holds a reference on the original MR. The original MR may not be destroyed until all derived MRs have been closed. When a MR is created, a derive_cnt field may be provided to indicate the number of expected derived MRs that an application intends to create. This field is considered an optimization and may be ignored by the provider. Providers that do not support derived MRs may simply create a new MR without sharing resources with the original MR. A derived MR is subject to reported provider restrictions, such as IBV_QP_USAGE_ATTACH_MR. Signed-off-by: Sean Hefty <shefty@nvidia.com>

The UET initiator is equivalent to an MPI rank or CCL communicator ID. It is a user settable value used for tag matching purposes. UET carries the initiator field directly in the transport header. Extend the initiator QP attributes to allow user to set the value. We use the more generic term, src_id, instead of the UET specific term. The naming is aligned with src_qpn in ibv_wc. Signed-off-by: Sean Hefty <shefty@nvidia.com>

UET associates multiple receive queues with a single queue pair. In UET terms, a QP maps to a PIDonFEP, and the receive queues are known as resource indices. Resource indices allow for receive side resources to be separated, such that they may be dedicated to separate services (e.g. MPI, CCL, storage). To support separate resources, we reuse the verbs work queue objects (ibv_wq). The API is extended slightly for UET. First, we add an extended device attribute, max_rqw_per_qp, to limit the number of WQs which may be associated with a QP. Secondly, we extend the WQ attributes to allow the user to select the wq_num (i.e. UET resource index) associated with a WQ. It is the responsibility of higher-level SW to allocate, configure, and associate WQs with QPs, so that the QP is assigned the correct number of WQs with the necessary addresses. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Include descriptions of new objects introduced for UET: job, jkey, and address table, with verbs semantic constructs definitions. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Subject: [PATCH] librdmacm: Fix rdma_resolve_addrinfo() deadlock in sync mode Fix the issue that rdma_resolve_addrinfo() gets deadlock when run in sync mode: (gdb) bt #0 futex_wait #1 __GI___lll_lock_wait linux-rdma#2 0x00007ffff7dae791 in lll_mutex_lock_optimized linux-rdma#3 ___pthread_mutex_lock linux-rdma#4 0x00007ffff7f9f018 in ucma_process_addrinfo_resolved linux-rdma#5 0x00007ffff7fa1447 in rdma_get_cm_event linux-rdma#6 0x00007ffff7fa1fef in ucma_complete linux-rdma#7 0x00007ffff7fa2f9c in resolve_ai_sa linux-rdma#8 0x00007ffff7fa36ab in __rdma_resolve_addrinfo linux-rdma#9 rdma_resolve_addrinfo linux-rdma#10 0x00000000004017b6 in start_cm_client_sync linux-rdma#11 0x00000000004018ee in main Issue: 4582946 Fixes: 7b1a686 ("librdmacm: Provide interfaces to resolve IB services") Change-Id: Ia724795a559bab6d965a35b8fd3e0f0096472a44 Signed-off-by: Mark Zhang <markzhang@nvidia.com>

Fix the issue that rdma_resolve_addrinfo() gets deadlock when run in sync mode: (gdb) bt #0 futex_wait #1 __GI___lll_lock_wait linux-rdma#2 0x00007ffff7dae791 in lll_mutex_lock_optimized linux-rdma#3 ___pthread_mutex_lock linux-rdma#4 0x00007ffff7f9f018 in ucma_process_addrinfo_resolved linux-rdma#5 0x00007ffff7fa1447 in rdma_get_cm_event linux-rdma#6 0x00007ffff7fa1fef in ucma_complete linux-rdma#7 0x00007ffff7fa2f9c in resolve_ai_sa linux-rdma#8 0x00007ffff7fa36ab in __rdma_resolve_addrinfo linux-rdma#9 rdma_resolve_addrinfo linux-rdma#10 0x00000000004017b6 in start_cm_client_sync linux-rdma#11 0x00000000004018ee in main Signed-off-by: Mark Zhang <markzhang@nvidia.com>

Fix the issue that rdma_resolve_addrinfo() gets deadlock when run in sync mode: (gdb) bt #0 futex_wait #1 __GI___lll_lock_wait linux-rdma#2 0x00007ffff7dae791 in lll_mutex_lock_optimized linux-rdma#3 ___pthread_mutex_lock linux-rdma#4 0x00007ffff7f9f018 in ucma_process_addrinfo_resolved linux-rdma#5 0x00007ffff7fa1447 in rdma_get_cm_event linux-rdma#6 0x00007ffff7fa1fef in ucma_complete linux-rdma#7 0x00007ffff7fa2f9c in resolve_ai_sa linux-rdma#8 0x00007ffff7fa36ab in __rdma_resolve_addrinfo linux-rdma#9 rdma_resolve_addrinfo linux-rdma#10 0x00000000004017b6 in start_cm_client_sync linux-rdma#11 0x00000000004018ee in main Fixes: 7b1a686 ("librdmacm: Provide interfaces to resolve IB services") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Sean Hefty <shefty@nvidia.com>

dkkranz and others added 30 commits July 13, 2025 08:47

efa: Add option to create single threaded CQ

fbd0b88

Add an option to create a single threaded CQ, if a single threaded CQ is created the CQ lock isn't taken on completion polling functions. Signed-off-by: Yonatan Nachum <ynachum@amazon.com>

libibverbs: Rename ibv_reg_mr_in to ibv_mr_init_attr

92ad54b

Rename the structure and input parameter to align with other libibverbs API calls. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Update library version to be 60.0

091ddb5

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

providers/mana: Add error code mappings for retry and rnr timeouts

003724a

Add vendor error code definitions and corresponding work completion status mappings for retry and rnr timeout exceeded errors. Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>

tests: Update PCIE mapping flag of mlx5 DMABUF

648c951

Set the appropriate CUDA flag when creating FD for mlx5 DMABUF. This flag is mandatory for data-direct traffic. Signed-off-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

pyverbs: Add DevX events API

5f6cefd

Add all necessary support for DevX events in pyverbs. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

tests: Add tests for DevX events

e304dfe

Add tests for DevX events with cq error. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

tests: Add flow counter test

348a32e

Add counter test for FW steering in mlx5_flow tests. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

tests: Add CX9 to MLX5_DEVS list

55ee455

Add CX9 to MLX5_DEVS list. Signed-off-by: Elyashiv Cohen <elyashivc@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

pyverbs: Add support to flow counters with offset

f291e45

Add new action to flow API to support flow counters with offset. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

tests: Add test for flow counter action with offset

5bb2523

Add test_counters_bulk_flow to test flow action counter with offset. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

pyverbs: Add support to MREX and DMA Handle

a15caa0

1. Add support to MR extended. 2. Add DMA Handle class. Signed-off-by: Maxim Chicherin <maximc@nvidia.com> Reviewed-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com>

Update kernel headers

bc5b068

To commit: 810f874eda8e ("RDMA/ucma: Support query resolved service records"). Signed-off-by: Sean Hefty <shefty@nvidia.com>

Merge pull request linux-rdma#1636 from yishaih/mlx5_misc

7a04b9e

mlx5: Misc. improvements

shefty force-pushed the uec branch from 840d8c7 to a2d46c7 Compare October 18, 2025 23:24

rleon and others added 3 commits October 19, 2025 17:26

Merge pull request linux-rdma#1646 from amzn/cq_doorbell

5321d80

efa: Extend DV query CQ to return doorbell

libhns: Clean up an extra blank line

19bd0c8

Clean up an extra blank line. Fixes: b479323 ("libhns: Fix the sge num problem of atomic op") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>

shefty force-pushed the uec branch from a2d46c7 to fc9a5de Compare October 24, 2025 21:31

rleon and others added 20 commits October 26, 2025 02:54

Update library version to be 61.0

2671a14

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Merge pull request linux-rdma#1652 from hginjgerx/fix

754e098

libhns: Minor fix and cleanup

Merge pull request linux-rdma#1650 from dstaay-fb/fix-byte-count-type

01341ca

mlx5: Fix byte_count type in umr_sg_list_create

Merge pull request linux-rdma#1648 from Timon-Kruiper/pyverbs_gil

7bb1297

pyverbs: Release Python GIL when calling blocking CMID functions

libibverbs: Add support for user to select the rkey

0451f91

UET allows for user selected rkey values to improve scalability. Expose support via a device capability flag and update memory registration accordingly. Signed-off-by: Sean Hefty <shefty@nvidia.com>

libibverbs: Update API documentation with UET job concepts

942bee0

Include descriptions of new objects introduced for UET: job, jkey, and address table, with verbs semantic constructs definitions. Signed-off-by: Sean Hefty <shefty@nvidia.com>

shefty force-pushed the uec branch from bce5336 to 942bee0 Compare October 31, 2025 16:44

shefty closed this Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] libibverbs: Add UET-verbs #1

[RFC] libibverbs: Add UET-verbs #1

Uh oh!

shefty commented Aug 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

[RFC] libibverbs: Add UET-verbs #1

[RFC] libibverbs: Add UET-verbs #1

Uh oh!

Conversation

shefty commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

shefty commented Aug 6, 2025 •

edited

Loading