Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS memory allocation error in Scikit-learn KMeans & kNN & DBSCAN #3321

Closed
OnlyDeniko opened this issue Jul 20, 2021 · 49 comments · Fixed by #3352
Closed

BLAS memory allocation error in Scikit-learn KMeans & kNN & DBSCAN #3321

OnlyDeniko opened this issue Jul 20, 2021 · 49 comments · Fixed by #3352

Comments

@OnlyDeniko
Copy link

OnlyDeniko commented Jul 20, 2021

scikit-learn/scikit-learn#20539

Do you have any ideas?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 20, 2021

The recommendation given there was on the right track, I wonder why it did not work. Multithreaded OpenBLAS requires a memory buffer per thread, and the maximum number of buffers is set at compile time. So there is an (ideally/normally )invisible limitation caused by what the OpenBLAS that came with either numpy or your operating system was configured for. Does it work when you set OPENBLAS_NUM_THREADS to a smaller value, like 16 or 32 ? (The OpenBLAS that comes with numpy 1.21 is built for 64 threads as recently established here #3318 (comment) but maybe you have some other version imported elsewhere in your combination of programs)

@OnlyDeniko
Copy link
Author

I installed numpy: pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy
Reproducer works only with OPENBLAS_NUM_THREADS <= 17

@martin-frbg
Copy link
Collaborator

Sounds like that package was built with NUM_THREADS=16 although the conda-forge "recipe" for building openblas packages sets this number to 128. Not sure about the who/how/where for the packages though.

@OnlyDeniko
Copy link
Author

OnlyDeniko commented Jul 20, 2021

Just to clarify, I download Scikit-learn from pip channel. Numpy downloads as a dependency and installing with openblas.
I found .so file in site-packages/numpy.libs folder, maybe it can help you libopenblas64_p-r0-6d9684d7.3.17.so

@brada4
Copy link
Contributor

brada4 commented Jul 20, 2021

These nightly builds are unstable and are only available as pip packages on PyPI

Please install official stable versions with conda and re-check. They dont even have the record of the release tag 6d9684d7 in the search

@OnlyDeniko
Copy link
Author

I have no errors when I download numpy from conda channels (main, conda-forge, intel) because openblas downloads separately.
Error appears when I download numpy from pip. So, I download last stable version of numpy 1.21.1from pip channel and in site-packages/numpy.libs have libopenblasp-r0-2d23e62b.3.17.so

@brada4
Copy link
Contributor

brada4 commented Jul 21, 2021

So, problem is gone?
We cannot decode those numbers on real pypi, anaconda pypi, or conda forge. What is certain we did not make nightly build you downloaded from anaconda pypi.

@OnlyDeniko
Copy link
Author

No, the problem remains. I do not download any nightly builds, I just type this: pip install numpy

@brada4
Copy link
Contributor

brada4 commented Jul 21, 2021

Please check configuration of .so files you got:

ctypes.CDLL("/path/to/your/lib/openblas.so").openblas_get_config()

nproc from your machine? Seems like that was not answered in other thread.

@martin-frbg
Copy link
Collaborator

If you do not want to limit the number of OpenBLAS threads, the only solution would seem to be to build OpenBLAS from source with a high enough NUM_THREADS. I cannot match the hash in the library names you mentioned to a specific build (and consequently build options), but I would expect the maximum supported thread count to be at least 64. Maybe what is happening
is that parts of Scikit-learn are themselves making parallel calls into OpenBLAS - limiting the thread count may even provide a performance benefit in that situation

@OnlyDeniko
Copy link
Author

Please check configuration of .so files you got:

ctypes.CDLL("/path/to/your/lib/openblas.so").openblas_get_config()

nproc from your machine? Seems like that was not answered in other thread.

nproc=96

>>> import ctypes
>>> ctypes.CDLL("/home/ubuntu/miniconda3/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so").openblas_get_config()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/lib/python3.8/ctypes/__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libgfortran-2e0d59d6.so.5.0.0: cannot open shared object file: No such file or directory
$ ldd libopenblasp-r0-2d23e62b.3.17.so
        linux-vdso.so.1 (0x00007ffcdf182000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1f42a1b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1f427fc000)
        libgfortran-2e0d59d6.so.5.0.0 => not found
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1f4240b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1f44d16000)

@isuruf
Copy link
Contributor

isuruf commented Jul 22, 2021

There are multiple installations you are talking about here.

libopenblas64_p-r0-6d9684d7.3.17.so

This is a nightly build with INTERFACE64=1

libopenblasp-r0-2d23e62b.3.17.so

I'm not sure where you got this from. It doesn't seem to be from PyPI

Can you remove numpy and make sure that there's nothing in /home/ubuntu/miniconda3/lib/python3.8/site-packages/numpy.libs/
and then install using pip install numpy? Send the output of pip install numpy.

@OnlyDeniko
Copy link
Author

As you recommended I install numpy using this: pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy
And I got in site-packages/numpy.libs this libopenblas64_p-r0-6d9684d7.3.17.so. But it is does not help, so let's forget about it and talk about only about pip install numpy
You can try to download and will get the following result:

$ pip install numpy
Collecting numpy
  Using cached numpy-1.21.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.21.1

@brada4
Copy link
Contributor

brada4 commented Jul 22, 2021

Build command for OpenBLAS here: i.e DYNAMIC_ARCH and 128 threads
https://github.com/conda-forge/openblas-feedstock/blob/25dab765a98489e0a3e2ca8c3e7094e21e471425/recipe/build.sh#L68

@OnlyDeniko
Copy link
Author

Build command for OpenBLAS here: i.e DYNAMIC_ARCH and 128 threads
https://github.com/conda-forge/openblas-feedstock/blob/25dab765a98489e0a3e2ca8c3e7094e21e471425/recipe/build.sh#L68

I'm not going to manually build openblas. As a user, I want to be sure that with the most ordinary download of a numpy, it will start and work on any machine. When downloading from a conda channels, this happens, but the problem is in the pip channel. Have you been able to reproduce the problem yourself?

@isuruf
Copy link
Contributor

isuruf commented Jul 22, 2021

Can you run the script with OPENBLAS_VERBOSE=2?

@OnlyDeniko
Copy link
Author

Can you run the script with OPENBLAS_VERBOSE=2?

Core: SkylakeX
OpenBLAS : Your OS does not support AVX512VL instructions. OpenBLAS is using Haswell kernels as a fallback, which may give poorer performance.
Core: Haswell
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped)

@martin-frbg
Copy link
Collaborator

Is there any control over the parallelism in scikit ? Assuming you had 8 threads running in parallel, each calling an OpenBLAS function that uses 16 threads, there would be no free slots in the buffer list for a 9th thread doing the same.

@OnlyDeniko
Copy link
Author

Is there any control over the parallelism in scikit ? Assuming you had 8 threads running in parallel, each calling an OpenBLAS function that uses 16 threads, there would be no free slots in the buffer list for a 9th thread doing the same.

Actually, I do not know. Gotta ask the guys from the scikit-learn team

@brada4
Copy link
Contributor

brada4 commented Jul 22, 2021

This parallelism is used.
https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources

Please set OPENBLAS_NUM_THREADS=1

@jeremiedbb
Copy link

In KMeans we call OpenBLAS gemm inside a parallel (openmp) loop, but we set openblas num threads to one to avoid nesting parallelism.

In KNN and DBSCAN we call OpenBLAS in a multi-process setup and as mentioned above we set the number of openblas threads such that the total number of threads does not exceed the number of cpus.

Setting OPENBLAS_NUM_THREADS=1 means all openblas calls will be sequential in non-nested regions which is non optimal unfortunately.

@OnlyDeniko do you set the n_jobs parameter for these estimators ?

@OnlyDeniko
Copy link
Author

@jeremiedbb I set n_jobs=-1

@jeremiedbb
Copy link

Could you try setting to a lower value like 16 32 64 and see when it breaks ?

@brada4
Copy link
Contributor

brada4 commented Jul 22, 2021

@jeremiedbb it is MAX(50,NUM_CPU*2) memory regions.
This means that many temporary allocations can be used, one per call max, could be another if calls are nested.
Namely it breaks when your_threads X openblas_threads exceeds 256 , and it gets slow when you exceed real cores anyway.

@OnlyDeniko
Copy link
Author

Could you try setting to a lower value like 16 32 64 and see when it breaks ?

Kmeans works with n_jobs <= 65
KNN does not work with n_jobs=-1, but works with n_jobs=None

@brada4
Copy link
Contributor

brada4 commented Jul 22, 2021

It is timing race until you reach 256 allocations. Oversubscription damages performance worse than linear. You somehow need to achieve that there is one OpenBLAS thread per CPU for fastest result, say njobs=48 OPENBLAS_NUM_THREADS=2 or something else that multiplies to 96 cores and returns result in shortest time.

@jeremiedbb
Copy link

For KMeans, we deal with the number of openblas threads internally so setting OMP_NUM_THREADS=64 or n_jobs=64 should be enough.

For KNN, I'd suggest to set n_jobs=1 and maybe OPENBLAS_NUM_THREADS=64, since I don't think multiprocessing brings something for this estimator. We are currently reworking it to have a way better scalability on multicore settings but it's still WIP.

@isuruf
Copy link
Contributor

isuruf commented Jul 22, 2021

Since it looks like this is just a matter of increasing the parameter at build time from 64 to 128, can you open an issue in https://github.com/MacPython/openblas-libs/ ?

@brada4
Copy link
Contributor

brada4 commented Jul 23, 2021

this one was official release w openblas binary pulled from conda. 128threads and wild , well documented cpu oversubscription.
i think this issue can return to scikit. ctyprs trick can set threads on runtime.

@OnlyDeniko
Copy link
Author

python -m threadpoolctl --import sklearn
[
  {
    "filepath": "/home/ubuntu/miniconda3/envs/dkulandi_bench/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 96
  },
  {
    "filepath": "/home/ubuntu/miniconda3/envs/dkulandi_bench/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.17",
    "num_threads": 64,
    "threading_layer": "pthreads",
    "architecture": "SkylakeX"
  },
  {
    "filepath": "/home/ubuntu/miniconda3/envs/dkulandi_bench/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9",
    "num_threads": 64,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  }
]
python -c "import joblib; print(joblib.cpu_count(only_physical_cores=True))"
48
python -c "import joblib; print(joblib.cpu_count())"
96
lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:            7
CPU MHz:             1201.212
BogoMIPS:            5999.97
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

@ogrisel
Copy link
Contributor

ogrisel commented Jul 28, 2021

Thanks for the details.

So we get the confirmation that your code relies on the OpenBLAS shipped in the numpy and scipy wheels and each wheel brings a different version. Usually this is not a problem.

But I am still not sure why this crashes in scikit-learn:

  • when calling KMeans with n_jobs=-1 this is equivalent to calling KMeans with n_jobs=96 on your machine and this creates 96 OpenMP threads, each of them should be OpenBLAS in sequential mode thanks to this line. joblib is not used at all in this case. So I don't understand why we have the problem. (see the edit below).

  • for k-NN and DBSCAN this is another story: in this case n_jobs=-1 is resolved to n_jobs=96 and creates 96 independent Python processes via joblib. Each of them is initialized with the OPENBLAS_NUM_THREADS=max(1, CPU_COUNT/n_jobs)=1 environment variable. So we should not have a problem either.

In either cases we should neither get oversubscription related performance problems: OpenBLAS should always run in sequential mode in the end.

According to:

https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md#program-is-terminated-because-you-tried-to-allocate-too-many-memory-regions

The error you observe could still be resolved by increasing the NUM_THREADS make variable at build time, as conda-forge does.

@OnlyDeniko do you confirm that you do not reproduce the problem if you install everything from conda-forge which sets NUM_THREADS=128 at build time? You can create a dedicated env with:

conda create -n sklearn-cf -c conda-forge scikit-learn
conda activate sklearn-cf
python -m threadpoolctl --import sklearn  # just to check
python your_reproducer_script.py

Python, numpy, scipy, openblas, joblib and threadpoolctl are all dependencies of scikit-learn so conda will install them all from conda-forge automatically.

Edit: from the reference linked above:

Despite its name, and due to the use of memory buffers in functions like SGEMM, the setting of NUM_THREADS can be relevant even for a single-threaded build of OpenBLAS, if such functions get called by multiple threads of a program that uses OpenBLAS. In some cases, the affected code may simply crash or throw a segmentation fault without displaying the above warning first.

So indeed for KMeans, even if OpenBLAS is called with 1 thread at runtime by 96 OpenMP threads so this might be the problem.

@jeremiedbb
Copy link

But I am still not sure why this crashes in scikit-learn:
when calling KMeans with n_jobs=-1 this is equivalent to calling KMeans with n_jobs=96 on your machine and this creates 96 OpenMP threads, each of them should be OpenBLAS in sequential mode thanks to this line. joblib is not used at all in this case. So I don't understand why we have the problem.
for k-NN and DBSCAN this is another story: in this case n_jobs=-1 is resolved to n_jobs=96 and creates 96 independent Python processes via joblib. Each of them is initialized with the OPENBLAS_NUM_THREADS=max(1, CPU_COUNT/n_jobs)=1 environment variable. So we should not have a problem either.

in both cases we try to create 96 memory regions which is more than 64. For KMeans, setting OMP_NUM_THREADS=64 or n_jobs=64 should be ok. For KNN setting n_jobs=64 and OPENBLAS_NUM_THREADS should be ok (alternatively n_jobs=1 and OPENBLAS_NUM_THREADS=64).

@ogrisel
Copy link
Contributor

ogrisel commented Jul 28, 2021

I don't understand why this breaks when we use joblib sub-processes for KNN: each worker process manages its memory independently of the other. There should be no shared buffers.

@jeremiedbb
Copy link

jeremiedbb commented Jul 28, 2021

in KNN parallelism (assuming brute force) comes from pairwise distances computations which uses joblib with the threading backend.
Edit: we also use the threading backend for tree based solvers

@ogrisel
Copy link
Contributor

ogrisel commented Jul 28, 2021

Alright that makes sense then. And DBSCAN does the same to precompute the neighborhood graph.

@OnlyDeniko
Copy link
Author

Thanks for the details.

So we get the confirmation that your code relies on the OpenBLAS shipped in the numpy and scipy wheels and each wheel brings a different version. Usually this is not a problem.

But I am still not sure why this crashes in scikit-learn:

  • when calling KMeans with n_jobs=-1 this is equivalent to calling KMeans with n_jobs=96 on your machine and this creates 96 OpenMP threads, each of them should be OpenBLAS in sequential mode thanks to this line. joblib is not used at all in this case. So I don't understand why we have the problem. (see the edit below).
  • for k-NN and DBSCAN this is another story: in this case n_jobs=-1 is resolved to n_jobs=96 and creates 96 independent Python processes via joblib. Each of them is initialized with the OPENBLAS_NUM_THREADS=max(1, CPU_COUNT/n_jobs)=1 environment variable. So we should not have a problem either.

In either cases we should neither get oversubscription related performance problems: OpenBLAS should always run in sequential mode in the end.

According to:

https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md#program-is-terminated-because-you-tried-to-allocate-too-many-memory-regions

The error you observe could still be resolved by increasing the NUM_THREADS make variable at build time, as conda-forge does.

@OnlyDeniko do you confirm that you do not reproduce the problem if you install everything from conda-forge which sets NUM_THREADS=128 at build time? You can create a dedicated env with:

conda create -n sklearn-cf -c conda-forge scikit-learn
conda activate sklearn-cf
python -m threadpoolctl --import sklearn  # just to check
python your_reproducer_script.py

Python, numpy, scipy, openblas, joblib and threadpoolctl are all dependencies of scikit-learn so conda will install them all from conda-forge automatically.

Edit: from the reference linked above:

Despite its name, and due to the use of memory buffers in functions like SGEMM, the setting of NUM_THREADS can be relevant even for a single-threaded build of OpenBLAS, if such functions get called by multiple threads of a program that uses OpenBLAS. In some cases, the affected code may simply crash or throw a segmentation fault without displaying the above warning first.

So indeed for KMeans, even if OpenBLAS is called with 1 thread at runtime by 96 OpenMP threads so this might be the problem.

Yes, I confirm

[
  {
    "filepath": "/home/ubuntu/miniconda3/envs/sklearn-cf/lib/libopenblasp-r0.3.17.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.17",
    "num_threads": 96,
    "threading_layer": "pthreads",
    "architecture": "SkylakeX"
  },
  {
    "filepath": "/home/ubuntu/miniconda3/envs/sklearn-cf/lib/libgomp.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 96
  }
]

@ogrisel
Copy link
Contributor

ogrisel commented Jul 28, 2021

I think we understand the root cause and the solution of the problem now (and workarounds). I think we can close the issue on this repo in favor of MacPython/openblas-libs#64 which I just created.

@martin-frbg
Copy link
Collaborator

@ogrisel thank you very much for looking into this. This buffer remains the major design flaw in OpenBLAS, but I suspect the only thing I can do short-term to mitigate its effect is to add more information to the error message, in particular the number of threads the library was built for.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 28, 2021

I suspect the only thing I can do short-term to mitigate its effect is to add more information to the error message, in particular the number of threads the library was built for.

That would be great. You could also link to a dedicated markdown document on github that gives more details to users on how to introspect how many CPUs they have on their machine and how OpenBLAS where was installed from (I am pretty sure that most of OpenBLAS users do no know that they use OpenBLAS because they use it via numpy, scipy, pytorch, R or something similar).

@ogrisel
Copy link
Contributor

ogrisel commented Aug 19, 2021

Actually, this is not the only problem: using more than 64 threads seems to degrade the performance of a 4096x4096 DGEMM, see MacPython/openblas-libs#64 (comment).

@martin-frbg
Copy link
Collaborator

It is certainly possible to throw so many threads at a "small" problem that performance degrades again, but I believe it would need an unmanageable (and itself costly) set of rules to tailor the number of threads to each problem size, where OpenBLAS currently switches between 1 and all threads only. Hardware layout (cache locality, multi-die cpu interconnects etc) will also play a role.

@ogrisel
Copy link
Contributor

ogrisel commented Aug 20, 2021

I am not sure how to move forward with this. Increasing NUM_THREADS in the default builds of OpenBLAS used by the majority of the Python ecosystem is not necessarily a good idea because it can cause performance degradations of typical numpy/scipy workloads when running on machines with hundreds of cores.

Implementing an ad-hoc mitigation in scikit-learn for estimators who call OpenBLAS routines in sequential mode from a large number of externally managed threads is possible but complex, hard to maintain and brittle. See MacPython/openblas-libs#64 (comment) for a minimal reproducer and some details on how to technically implement this. But such an ad-hoc mitigation would not solve the problem for other libraries (apparently it might impact PyTorch users as well).

Ideally the problem should be solved in OpenBLAS by making it possible to allocate extra buffers when needed when OpenBLAS is called by a large number of externally managed threads.

@mattip
Copy link
Contributor

mattip commented Aug 20, 2021

It would be nice if there is a mechanism to report the error and return a sentinel or set some errno without crashing the process.

@martin-frbg
Copy link
Collaborator

you are not the first to come up with that suggestions. Unfortunately when we reach this situation there is nowhere left to go, and there is no universally agreed error code or mechanism to return "BLAS just died on you" anyway

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 26, 2021

@ogrisel @mattip can you give #3350 #3352 a spin please ? (This should malloc space for another 512 threads in an emergency - anything but elegant but probably better than giving up). Unfortunately our drone CI has stopped running and our travis is not yet up again after the migration, leaving me without serious multicore hardware. (PR was tested on a 12C system with OpenBLAS intentionally crippled to support only 4 threads though)

@linuxl7
Copy link

linuxl7 commented Apr 28, 2023

change \site-packages\joblib\externals\loky\backend\context.py can be do it;

os_cpu_count = min(os.cpu_count() or 1,12)

cpu_count_user = min(_cpu_count_user(os_cpu_count),12)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants