Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support python -Xcpu_count=<n> feature for container environment. #109595

Closed
corona10 opened this issue Sep 20, 2023 · 24 comments
Closed

Support python -Xcpu_count=<n> feature for container environment. #109595

corona10 opened this issue Sep 20, 2023 · 24 comments
Assignees
Labels
3.13 bugs and security fixes type-feature A feature request or enhancement

Comments

@corona10
Copy link
Member

corona10 commented Sep 20, 2023

Feature or enhancement

As #80235, there are requests for isolating CPU count in k8s or container environment, and this is a very important feature these days. (Practically my corp, a lot of workloads are running under container environments, and controlling CPU count is very important to resolve busy neighborhood issues)

There were a lot of discussions, and following the cgroup spec requires a lot of complexity and performance issues (due to fallback).
JDK 21 chooses not to depend on CPU Shares to compute active processor count and they choose to use -XX:ActiveProcessorCount=<n>.
see: https://bugs.openjdk.org/browse/JDK-8281571

I think that this strategy will be worth using from the CPython side too.
So if the user executes the python with -Xcpu_count=3option os.cpu_count will return 3, instead of the actual CPU count that was calculated from os.cpu_count.

cc @vstinner @indygreg

Linked PRs

@corona10 corona10 self-assigned this Sep 20, 2023
@corona10 corona10 added type-feature A feature request or enhancement 3.13 bugs and security fixes labels Sep 20, 2023
@vstinner
Copy link
Member

I would prefer to first fix os.cpu_count() to take sched_getaffinity() in account. But we should provide the two values: taking affinity in account, and "total" number of CPUs. It might be tricky to change the default :-(

@corona10
Copy link
Member Author

It might be tricky to change the default :-(

Providing -Xcpu_count will not touch any default behavior until the user explicitly set the value. So it will be the expected side-effect from the user side.

I would prefer to first fix os.cpu_count() to take sched_getaffinity() in account

We need to understand the actual user's usecase. AFAIK k8s users or container users never use something like taskset to isolate CPU resources. They just write a Docker file and set the resources to k8s pods.
Even if we update os.cpu_count to reflect sched_getaffinity, it will not be directly helpful to actual container people.

I think that that's why JDK still provides -XX:ActiveProcessorCount=<n>.
Until we do not provide a practical way to limit CPU resources, there is no way to reflect os.cpu_count to follow their limitation.
And as you know, container application is veeeery important use case in modern applications.

@vstinner
Copy link
Member

os.cpu_count() is used in other use cases than containers, where CPU affinity is used.

If we modify cpu_count() to take affinity in account and/or if your -X option is implemented, we should add an option to get "total number of CPUs".

@indygreg
Copy link
Contributor

AFAIK k8s users or container users never use something like taskset to isolate CPU resources.

I have used taskset in Kubernetes. It is useful to pin processes to specific CPU cores without using something more advanced/modern like the CPU Manager feature.

One of the things I used it for was testing the behavior of a JVM when passing -XX:ActiveProcessorCount= to a different value than the number of CPUs being allowed to schedule on.

There's also the scenario where you have a container running multiple processes and don't want a process to use all available CPUs. In this scenario, in the absence of [something like] taskset you need something like -XX:ActiveProcessorCount= to forcefully override the default heuristics. (This use case is important for people who realize how inefficient the artificial containers are only 1 process rule is and have chosen to eschew it.)

@vstinner
Copy link
Member

vstinner commented Sep 21, 2023

I created issue #109649: os.cpu_count(): add "affinity" parameter to get the number of "usable" CPUs.

UPDATE: I renamed the usable parameter to affinity.

@corona10
Copy link
Member Author

There's also the scenario where you have a container running multiple processes and don't want a process to use all available CPUs. In this scenario, in the absence of [something like] taskset you need something like -XX:ActiveProcessorCount= to forcefully override the default heuristics. (This use case is important for people who realize how inefficient the artificial containers are only 1 process rule is and have chosen to eschew it.)

Yeah, this is exactly what I want to say :) Thank you for emphasizing!

@corona10
Copy link
Member Author

corona10 commented Sep 22, 2023

The reason I don't support cgroup directly is the following reasons.

docker run ... --cpu-shares=512 java .... ==> os::active_processor_count() = 1
docker run ... --cpu-shares=1024 java ... ==> os::active_processor_count() = 32 (total CPUs on this system)
docker run ... --cpu-shares=2048 java ... ==> os::active_processor_count() = 2
  • It will increase overhead itself to os.cpu_count itself, if we decide to parse cgroup, we need to parse cgroup anyway by default, AFAIK there is no way to avoid this path if we decide to support cgroup as seamless approach to what people really want.
  • Supporting cgroupv2 is much harder than cgroupv1, it will increase the maintenance cost for cpython codebase, and we need to follow up new specifications if cgroupv3 is released.
  • Let's not follow what JDK already tried and failed: https://bugs.openjdk.org/browse/JDK-8281571
  • Adding a new flag to os.cpu_count() will not propagate to all libraries that the program depends on but this solution will do.

@indygreg
Copy link
Contributor

Is there an appetite for allowing non-integer / special values to -Xcpu_count= that define the mechanism to derive the CPU count? e.g.

  • -Xcpu_count=system-processor-count would use the equivalent of os.cpu_count() today.
  • -Xcpu_count=cgroups-shares-1024 would read from cgroups cpu.shares and divide by 1024.
  • -Xcpu_count=sched-affinity would compute CPU count from scheduler affinity.

Obviously we would need to bikeshed the names and behavior bit. But I see this as a potential elegant solution that not only allows forceful count overrides but also allows dynamic derivation using well-defined semantics. It also gives CPython more flexibility to introduce new variants/behavior without breaking backwards compatibility. It seemingly placates all parties.

@vstinner
Copy link
Member

-Xcpu_count=sched-affinity would compute CPU count from scheduler affinity.

As the author of PR #109649, I'm interested by this mode 😁

@vstinner
Copy link
Member

It seemingly placates all parties.

It seems like there is no good default behavior fitting all use cases. Giving more choices should please more people.

@vstinner
Copy link
Member

Since you added PYTHONCPUCOUNT env var, the effect of the env var is wider than a single process, child processes are affected as well.

Can you please add a -X cpu_count value to ignore PYTHONCPUCOUNT: in short, get Python 3.12 behavior (total number of CPUs) when it is really what I want? Maybe it can just be:-X cpu_count=default.

See my example for a concrete use case: #109652 (comment)

In the UTF-8 mode, I did something similar: PYTHONUTF8=1 python -X utf8=0 ignores PYTHONUTF8 env var and ensures that the UTF-8 Mode is not enabled to this command.

@vstinner
Copy link
Member

-Xcpu_count=cgroups-shares-1024 would read from cgroups cpu.shares and divide by 1024.

It may solve my main worry about parsing cgroups: avoid hardcoding AWS 1024 constant which is not a standard but an arbitrary value (no?).

By the way, how do we round the number of CPUs, knowing that cpu_count() returns an int? I suppose that it should be rounded up (towards infinity). For example 500/1024 should return 1, but 1200/1024 should return 1 or 2? Maybe round to the nearest but always return a minimum of 1? That's called ROUND_HALF_EVEN.

cpu_count=min(round(shares / ratio), 1)

Well, Donghee initial feature request: provide an integer, ignore affinity, cgroups and anything else, also makes sense. Sometimes, the sysadmin wants to get full control on what's going on.

I don't think that having different choices is worse. It gives users the choice to select what better fits their needs.

@corona10
Copy link
Member Author

corona10 commented Sep 24, 2023

@vstinner

Can you please add a -X cpu_count value to ignore PYTHONCPUCOUNT: in short, get Python 3.12 behavior (total number of CPUs) when it is really what I want? Maybe it can just be:-X cpu_count=default.

That's a nice suggestion see: 89d8bb2
We can add an "affinity" option for the similar purpose, but it's up to you!

@vstinner
Copy link
Member

Use cases

This issue is complicated since they are multiple use cases. Let me try to list some of them.

  • (A) multiprocessing-like use case: I want to use my machine "at 100%". It can be about creating threads, processes, or both. For example, if a machine has 12 logical CPUs and 6 CPU cores, you want to spawn at at least 12 worker processes in multiprocessing.

  • (B) System information: report hardware specifications, how many physical CPUs a machine has, and the number of logical CPUs or "threads".

  • (C) Deploy an application of type (A) and adapt it (sysadmin) to better use machine resources. In this case, you can inject files like sitecustomize.py or a PTH file, or even modify the code of application.

  • (D) Restricted environment, read-only container image: similar to (C), but you can only change environment variables and the command line. That's all.

I think that use case (A) should be elaborated:

  • If CPU affinity is used on the Python process, the number of CPUs "available" for the process should be used, instead of the total number of CPUs, otherwise the risk is to increase the latency and the risk of timeout.
  • If Python is running in a Linux cgroups with CPU Throttling ("CPU shares"), again, it should respect that for same reasons.

CPUs

A CPU core is a physical core, but these days, it's more convenient to count "logical CPUs" because Hyper Threading is really close to 2x faster when run you 2 threads per CPU core.

When you consider a virtual machine, we don't talk about physical CPUs anymore, but "virtual CPUs" aka "vCPUs".

It's also possible to add or remove CPUs at runtime. On Linux, a CPU can be "on" or "off".

Limit CPUs

A system administrator has different ways to limit CPU usage:

  • Tune the CPU itself: change the CPU frequency, disable Turbo Boost, set a performance profile, etc. It's also related to the Thermal Design Power (TDP), CPU p-state, CPU c-state and many other complicated things. IMO it's not really relevant for the current discussion. Since these changes don't affect how many worker processes/threads should be spawned.

  • CPU affinity. For example, use taskset -c 0-1 python on Unix to restrict to Python to CPU #0 and CPU Support "bpo-" in Misc/NEWS #1 (two CPUs). On Windows, you can use start /affinity 0x01 python.exe to restrict Python to the first CPU.

  • Linux croups: they are tons of tools to put a process in a cgroups and tune it's "CPU share".

Number of CPUs

Ok, now to come back to the number of CPUs, there are:

  • os.cpu_count(): platform-specific code to get the total number of logical CPUs of the system.
  • os.sched_getaffinity(0): get CPU affinity of the current process.
  • cgroups: honestly, I don't know what's the most reliable way to get CPU shares. But what I understood is that there is a ratio to convert a fraction of CPU time into a "number of CPUs", and this ratio is arbitrary. So this ratio should be a parameter, it cannot be read or computed.

@corona10 wrote that Java did its best to get cgroups but they decided to give us up, and instead provide a command line option so sysadmins can just tune the Java service using their knowledge of the machine.

Read-only container

For use case (C) and (D), we should discuss Python versions. There are two cases:

  • Python <= 3.12: os.cpu_count() and os.sched_getaffinity() functions are available, but not future development.
  • Pyhon 3.13 target: we can consider new features to better fit all use cases.

Let's take the example of a read-only container image.

  • With Python <= 3.12, you might find a way to create a variant of the image to inject sitecustomize.py or PTH file, or ask the vendor to change their image. It's possible to override os.cpu_count(), get environment variables and use -X options (sys._xoptions).

  • With Python 3.13, you can add new command line option and environment variables, so the read-only container image doesn't have to be modified.

@vstinner
Copy link
Member

For Python <= 3.12, it's possible to implement the feature without modifying Python, but just by injecting a sitecustomize.py or a PTH file. Well, any code loaded at Python startup.

import os, sys

def parse_cmdline():
    env_ncpu = None
    cmdline_ncpu = None

    env_opt = os.environ.get('PYTHONCPUCOUNT', None)
    if env_opt:
        try:
            ncpu = int(env_opt)
        except ValueError:
            print(f"WARNING: invalid PYTHONCPUCOUNT value: {env_opt!r}")
        else:
            env_ncpu = ncpu

    if 'cpu_count' in sys._xoptions:
        xopt = sys._xoptions['cpu_count']
        try:
            ncpu = int(xopt)
        except ValueError:
            print(f"WARNING: invalid PYTHONCPUCOUNT value: {xopt!r}")
        else:
            cmdline_ncpu = ncpu

    ncpu = env_ncpu
    if cmdline_ncpu:
        ncpu = cmdline_ncpu
    if ncpu:
        # Override os.cpu_count()
        def cpu_count():
            return ncpu
        cpu_count.__doc__ = os.cpu_count.__doc__
        os.cpu_count = cpu_count

parse_cmdline()

Example:

# Default
$ PYTHONPATH=$PWD ./python -c 'import os; print(os.cpu_count())'
12

# -X option
$ PYTHONPATH=$PWD ./python -X cpu_count=40 -c 'import os; print(os.cpu_count())'
40

# env var
$ PYTHONCPUCOUNT=20 ./python -c 'import os; print(os.cpu_count())'
20

@vstinner
Copy link
Member

I propose to:

  • Add os.process_cpu_count(): PR gh-109649: Add os.process_cpu_count() function #109907
  • Add PYTHONCPUCOUNT=value env var and -X cpu_count=value which overrides os.cpu_count() and os.process_cpu_count() with a number.
  • For use case (B), display the number of system CPUS, I would like to support -X cpu_count=default (affect current process) which overrides PYTHONCPUCOUNT (system wide).

Later, we would consider:

  • Add pid argument to os.process_cpu_count().
  • Parse cgroups in os.process_cpu_count(): issue os.cpu_count() should return a count assigned to a container or that the process is restricted to #80235. Only if it's a good idea :-)
  • Support -X cpu_count=process: os.cpu_count() becomes an alias to os.process_cpu_count(). So legacy applications using os.cpu_count() for use case (A) (decide how many worker processes should be run) and run in read-only containers with Python 3.13 can be run with -X cpu_count=process to use the correct number of CPUs. Obviously, -X cpu_count=number would remain available if the sysadmin has a good knowledge of its system or if Python returns the wrong number of CPUs (especially in cgroups).

@vstinner
Copy link
Member

cc @gpshead @indygreg

@gpshead
Copy link
Member

gpshead commented Sep 29, 2023

Technical nit and reason why we shouldn't try to solve the larger "what is a core" problem within the stdlib itself:

A CPU core is a physical core, but these days, it's more convenient to count "logical CPUs" because Hyper Threading is really close to 2x faster when run you 2 threads per CPU core.

This is not accurate. hyperthreading is rarely, if ever, that meaningful for many workloads.

  • On a 20-core 40-thread Broadwell Xeon E5 v4 system make -j20 is ~10% faster than make -j40. (HT fail) (my hw & sw)
  • On an 24-core 48-thread Zen3 Epyc VM, make -j24 is ~10% faster than make -j48. (HT fail) (cloud i don't control - i expected a small win here but security mitigation could mean it blocks all HT benefits)
  • On an 4-core 8-thread Zen2 ryzen VM, make -j4 is ~20% slower than make -j8. (HT win) (my hw & sw)

In my sample set: AMD zens can have meaningful HT, at the moment, and eek out <~25% gain by doing so. Modern Intels with HT are no different.

A workload with wide mix of very different instruction usage such as integer heavy (ex: our interpreter) scheduled physically along side float / AVX heavy code might be able to get a little more HT parallelism - but that is not the common case and arranging for that is non-trivial (the OS generally won't detect and arrange this for you).


All that HT specific deep dive aside, what a "CPU" core is is changing. Big.little designs are becoming mainstream, not just for mobile devices anymore. For example Intel 12th gen and later can have P cores and E cores. So that 10 physical core laptop chip may have 12-14 threads, with some threads being 100-50% higher performance than others. Mac M-series have a similar mix of performance and efficiency cores.

To the OS each one of those presents as a "core". But the total compute throughput from each varies a lot, as does the range of compute latencies possible on each.

(end technical detail comment, leaving my thoughts on what we should for a followup)

@gpshead
Copy link
Member

gpshead commented Sep 29, 2023

I propose to:

  • Add os.process_cpu_count(): PR gh-109649: Add os.process_cpu_count() function #109907
  • Add PYTHONCPUCOUNT=value env var and -X cpu_count=value which overrides os.cpu_count() and os.process_cpu_count() with a number.
  • For use case (B), display the number of system CPUS, I would like to support -X cpu_count=default (affect current process) which overrides PYTHONCPUCOUNT (system wide).

I like these proposals, they are along the lines of what we can actually accomplish and provide clear concrete results. We're exposing raw os information and allowing the user to override that when they deem appropriate for practical purposes.

Answering deeper questions such as things pertaining to efficiency vs performance, throughput, latency, Linux cgroup shares, other container configs, cpu hotplug, or what VM cores actually are vs what an underlying hypervisor may be configured to actually schedule "core" processes on... are IMNSHO better left to continually evolving PyPI libraries. Because those are pretty fluid concepts.

For example, for practical reasons, software may choose to adopt higher level concept libraries such as https://pypi.org/project/threadpoolctl/ which @mdboom mentioned on Discord. Because application processes and libraries are often not isolated and work best if they coordinate which fractions of available compute resources they each use with the other cross-language libraries and processes all being used at once for their common goal. A raw numbers of available logical cores alone can't accomplish that.

I like most of your "Later, we would consider" proposals (as followup work) - except for the cgroups one: Per the above, I view that as too ill-defined of a concept for us to dictate any mapping of that to "cores" as a stdlib API.

@vstinner
Copy link
Member

I like most of your "Later, we would consider" proposals (as followup work) - except for the cgroups one: Per the above, I view that as too ill-defined of a concept for us to dictate any mapping of that to "cores" as a stdlib API.

I have mixed feeling about reading cgroups in Python stdlib. What I care the most here is to make sure that with the proposed designed, tomorrow, we can still change our mind, and read cgroups.

I'm saying that because my first proposition was to add an affinity parameter to os.cpu_count(). It's simple. It's easy. Why not? Well. If tomorrow, we read cgroups, what does it mean? We have to add a second cgroups parameters? Now what? The correct would look like:

kwargs = {}
if sys.version_info >= (3, 13):
    kwargs['affinity'] = True
if sys.version_info >= (3, 14):
    kwargs['cgroups'] = True
cpu_count = os.cpu_count(**kwargs)

Hu! That's not convenient.

I prefer a separated function which has no parameter: process_cpu_count(). It gets CPU affinity if available, or not. If tomorrow we consider reading cgroups, again, it would still fit into "return the number of CPU usable by the current process".

@corona10
Copy link
Member Author

Yeah, I like the proposal #109595 (comment), it will makes better situation for exceptional environment of Python users :)

corona10 added a commit to corona10/cpython that referenced this issue Sep 30, 2023
corona10 added a commit to corona10/cpython that referenced this issue Oct 3, 2023
corona10 added a commit to corona10/cpython that referenced this issue Oct 7, 2023
corona10 added a commit that referenced this issue Oct 10, 2023
---------

Co-authored-by: Victor Stinner <vstinner@python.org>
Co-authored-by: Gregory P. Smith [Google LLC] <greg@krypto.org>
@corona10
Copy link
Member Author

FYI, I am preparing -Xcpu_count=process option with the separate PR.

@vstinner
Copy link
Member

Would you mind to create a separated PR for -Xcpu_count=process?

@vstinner
Copy link
Member

Follow-up issue: #110649: Add -Xcpu_count=process cmdline mode to redirect os.cpu_count as os.process_cpu_count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants