Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off

### System Info

CPU affinity implementation (introduced in commit 59922f9bc16afee9efcc7ee1c5f9d753ef314ffa, first released in v2.3.0, until current HEAD (4b8cda684b45b799de01a65e3fe3422a34a621d3) ignores already existing CPU pinning for the process.



### Information

- [x] Docker
- [x] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

Try to run inference instance on CPU so that you set CPU and memory affinity to some custom value (for instance 8 CPUs from socket 1). CPU and memory affinity can be externally managed using taskset, numactl, cgroups cpuset controller, docker, Kubernetes CPU manager, NRI resource policies, for instance. 

Despite the affinity you choose, this implementation tries to use all CPUs in one NUMA node --- even if that is not allowed by the OS, and even if you wanted to run the process on CPUs of another NUMA node.

This issue prevents running several inference instances on the same system because they cannot be assigned to separate sockets, (sub)NUMA nodes, or any other disjoint CPU sets. However, running several instances on multi-socket systems significantly increase total token throughput per system, and on the other hand, several instances even on single-socket system allows serving more customers with reasonable latency, given that instances run on disjoint CPU sets.

This issue also prevents platform-specific optimization of single inference instance, as implicit assumptions are not optimal on every platform. For instance, the implementation prevents using high-bandwidth memory (HBM) directly built into some CPU chips (Xeon MAX). This happens because HBM is visible as a CPU-less NUMA node, and the current implementation sets memory affinity only on the NUMA node that include CPUs from the first socket. Also, if sub-NUMA clustering (SNC) is on, this implementation uses only a fraction of CPUs available in a socket.

### Expected behavior

If CPU and memory affinity has been already set, this inference instance must detect and respect those restrictions, and adjust its own behavior accordingly. It cannot use other than allowed CPUs or memories anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off #3011

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Activity

eero-t commented on Feb 11, 2025

askervin commented on Feb 11, 2025

eero-t commented on Feb 11, 2025

askervin commented on Feb 11, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions