Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPy tests timeouts / memory usage investigation #27662

Closed
lesteve opened this issue Oct 25, 2023 · 25 comments · Fixed by #29128
Closed

PyPy tests timeouts / memory usage investigation #27662

lesteve opened this issue Oct 25, 2023 · 25 comments · Fixed by #29128

Comments

@lesteve
Copy link
Member

lesteve commented Oct 25, 2023

EDIT: one of the main causes of the problem described below has already been fixed by #27670. However, despite this improvement, there are still important memory problems remaining when running the scikit-learn test suite on PyPy. So similar investigation and fixes are needed to iteratively solve the next worst offenders until the tests can run with an amount of memory comparable to what we observe with CPython (instead of a factor of 10).

Original description:

I had a closer look at the PyPy tests which have been timing out for a while, here is the result of my investigation. This may also help in the future to have a central issue for this rather than the discussion being split in different automatically created issues in this repo and scikit-learn-feedstock.

The PyPy tests locally needs ~11GB on my machine whereas it is 1.2GB with CPython. I ran them without using pytest-xdist to simplify things.

PyPy
pypy

CPython
cpython

It seems like one of the where the memory usage grows with time is the linear_model tests (needs 3.4GB with PyPy and 200MB with CPython locally).

PyPy
linear_model_pypy

CPython
linear_model_python

I manage to reproduce the issue (memory growing way more than on CPython) with the following snippet, where one of our Cython loss functions is called many times in a tight loop:

import psutil
import gc
from functools import partial
import platform

import numpy as np

from sklearn._loss.loss import HalfGammaLoss

IS_PYPY = platform.python_implementation() == "PyPy"

def func(data):
    loss = HalfGammaLoss()
    for i in range(10_000):
        loss(data, data)


def main():
    for i in range(101):
        if i % 10 == 0:
            memory_usage = psutil.Process().memory_info().rss / 1e9
            message = f'{i} psutil: {memory_usage:.3f}GB'
            if IS_PYPY:
                pypy_memory_usage = gc.get_stats().memory_allocated_sum
                message += f' , pypy total allocated: {pypy_memory_usage}'
            print(message)

        n_samples, n_features = 4, 12
        rng = np.random.RandomState(0)
        raw_prediction = rng.uniform(low=-3, high=3, size=n_samples)
        func(raw_prediction)

    if IS_PYPY:
        print(gc.get_stats())

if __name__ == '__main__':
    main()

Output shows that the memory usage grows with time and that it is about 10 times the total memory allocated reported by PyPy:

0 psutil: 0.175GB , pypy total allocated: 69.2MB
10 psutil: 0.537GB , pypy total allocated: 113.9MB
20 psutil: 0.888GB , pypy total allocated: 157.4MB
30 psutil: 1.257GB , pypy total allocated: 201.4MB
40 psutil: 1.595GB , pypy total allocated: 247.4MB
50 psutil: 1.931GB , pypy total allocated: 291.9MB
60 psutil: 2.325GB , pypy total allocated: 330.4MB
70 psutil: 2.659GB , pypy total allocated: 374.4MB
80 psutil: 3.003GB , pypy total allocated: 423.9MB
90 psutil: 3.327GB , pypy total allocated: 460.9MB
100 psutil: 3.663GB , pypy total allocated: 508.4MB

Not entirely sure whether this is a red herring, there is an issue with our Cython loss implementation, or this is an issue at the interaction between Cython and PyPy. Maybe @mattip has some insights about this?

On CPython the memory usage is stable around 100MB

0 psutil: 0.097GB
10 psutil: 0.098GB
20 psutil: 0.098GB
30 psutil: 0.098GB
40 psutil: 0.098GB
50 psutil: 0.098GB
60 psutil: 0.098GB
70 psutil: 0.098GB
80 psutil: 0.098GB
90 psutil: 0.098GB
100 psutil: 0.098GB
My environment for good measure

 mamba list               
# packages in environment at /home/lesteve/micromamba/envs/pypy:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2023.7.22            hbcca054_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
cython                    3.0.3            py39hc10206b_0    conda-forge
exceptiongroup            1.1.3              pyhd8ed1ab_0    conda-forge
execnet                   2.0.2              pyhd8ed1ab_0    conda-forge
expat                     2.5.0                hcb278e6_1    conda-forge
gdbm                      1.18                 h0a1914f_2    conda-forge
iniconfig                 2.0.0              pyhd8ed1ab_0    conda-forge
joblib                    1.3.2              pyhd8ed1ab_0    conda-forge
libblas                   3.9.0           19_linux64_openblas    conda-forge
libcblas                  3.9.0           19_linux64_openblas    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_2    conda-forge
libgfortran-ng            13.2.0               h69a702a_2    conda-forge
libgfortran5              13.2.0               ha4646dd_2    conda-forge
libgomp                   13.2.0               h807b86a_2    conda-forge
liblapack                 3.9.0           19_linux64_openblas    conda-forge
libopenblas               0.3.24          pthreads_h413a1c8_0    conda-forge
libsqlite                 3.43.2               h2797004_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_2    conda-forge
libxcb                    1.15                 h0b41bf4_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
numpy                     1.26.0           py39h6dedee3_0    conda-forge
openssl                   3.1.4                hd590300_0    conda-forge
packaging                 23.2               pyhd8ed1ab_0    conda-forge
pip                       23.3               pyhd8ed1ab_0    conda-forge
pluggy                    1.3.0              pyhd8ed1ab_0    conda-forge
psutil                    5.9.5            py39hf860d4a_1    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
pypy                      7.3.13                 0_pypy39    conda-forge
pypy3.9                   7.3.13               h9557127_0    conda-forge
pytest                    7.4.2              pyhd8ed1ab_0    conda-forge
pytest-repeat             0.9.2              pyhd8ed1ab_0    conda-forge
pytest-xdist              3.3.1              pyhd8ed1ab_0    conda-forge
python                    3.9.18                0_73_pypy    conda-forge
python_abi                3.9               4_pypy39_pp73    conda-forge
readline                  8.2                  h8228510_1    conda-forge
scipy                     1.11.3           py39h6dedee3_1    conda-forge
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
sqlite                    3.43.2               h2c6b66d_0    conda-forge
threadpoolctl             3.2.0              pyha21a80b_0    conda-forge
tk                        8.6.13               h2797004_0    conda-forge
tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.41.2             pyhd8ed1ab_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libx11               1.8.7                h8ee46fc_0    conda-forge
xorg-libxau               1.0.11               hd590300_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-xextproto            7.3.0             h0b41bf4_1003    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge

@github-actions github-actions bot added the Needs Triage Issue requires triage label Oct 25, 2023
@mattip
Copy link

mattip commented Oct 25, 2023

Thanks for looking into this.

The PyPy garbage collector kicks into action when it sees there is too much memory allocated (the pypy total allocated in your analysis. It "knows" about memory allocated for PyObjects in the C-API, and about python objects allocated by the interpreter. But it cannot know about memory allocted directly in C by malloc and friends (calloc, _alloc, ...). There is a way to tell the PyPy GC about such allocations using __pypy__.add_memory_pressure. Is there much direct C allocation in scikit-learn (this includes use of PyMem_RawMalloc and friends)?

Another possible cause for lack of collection is that PyPy does not break C-API memory cycles. There is some discussion of the problem in a open PyPy issue. TL;DR: we hope adoption of HPy will eliminate the dual representation of objects in C and in RPython that makes this problem very difficult. However, I would expect such cycles to at least show up in the pypy total allocated statistic.

@glemaitre
Copy link
Member

I check the Cython code for the loss in accordance to the snippet earlier. I think that there is 2 relevant piece of info. The allocation is done in the Python part:

if loss_out is None:
loss_out = np.empty_like(y_true)
# Be graceful to shape (n_samples, 1) -> (n_samples,)
if raw_prediction.ndim == 2 and raw_prediction.shape[1] == 1:
raw_prediction = raw_prediction.squeeze(1)
return self.closs.loss(
y_true=y_true,
raw_prediction=raw_prediction,
sample_weight=sample_weight,
loss_out=loss_out,
n_threads=n_threads,
)

So we have a np.empty there and then we call the Cython code.

def loss(
self,
const floating_in[::1] y_true, # IN
const floating_in[::1] raw_prediction, # IN
const floating_in[::1] sample_weight, # IN
floating_out[::1] loss_out, # OUT
int n_threads=1
):
cdef:
int i
int n_samples = y_true.shape[0]
if sample_weight is None:
for i in prange(
n_samples, schedule='static', nogil=True, num_threads=n_threads
):
loss_out[i] = {{closs}}(y_true[i], raw_prediction[i]{{with_param}})
else:
for i in prange(
n_samples, schedule='static', nogil=True, num_threads=n_threads
):
loss_out[i] = sample_weight[i] * {{closs}}(y_true[i], raw_prediction[i]{{with_param}})
return np.asarray(loss_out)

In this part, the only thing that could be weird is the call to np.asarray. I don't know if there is an interaction between Cython, the garbage collector, and the memory view.

@lesteve would you mind to try the following patch that avoid the call to np.asarray and just write directly in the memoryview. Unfortunately, I cannot quickly install PyPy on my machine.

diff --git a/sklearn/_loss/_loss.pyx.tp b/sklearn/_loss/_loss.pyx.tp
index 8efeeea77f..915380d260 100644
--- a/sklearn/_loss/_loss.pyx.tp
+++ b/sklearn/_loss/_loss.pyx.tp
@@ -1045,7 +1045,7 @@ cdef class {{name}}(CyLossFunction):
             ):
                 loss_out[i] = sample_weight[i] * {{closs}}(y_true[i], raw_prediction[i]{{with_param}})
 
-        return np.asarray(loss_out)
+        # return np.asarray(loss_out)
 
     {{if closs_grad is not None}}
     def loss_gradient(
diff --git a/sklearn/_loss/loss.py b/sklearn/_loss/loss.py
index 11cb0e42c4..bd3d0df884 100644
--- a/sklearn/_loss/loss.py
+++ b/sklearn/_loss/loss.py
@@ -189,13 +189,14 @@ class BaseLoss:
         if raw_prediction.ndim == 2 and raw_prediction.shape[1] == 1:
             raw_prediction = raw_prediction.squeeze(1)
 
-        return self.closs.loss(
+        self.closs.loss(
             y_true=y_true,
             raw_prediction=raw_prediction,
             sample_weight=sample_weight,
             loss_out=loss_out,
             n_threads=n_threads,
         )
+        return loss_out
 
     def loss_gradient(
         self,

@glemaitre
Copy link
Member

I found my linux machine and could try @lesteve script with the above changes. Interestingly, it removes the memory leak:

0 psutil: 0.199GB , pypy total allocated: 97.0MB
10 psutil: 0.227GB , pypy total allocated: 98.2MB
20 psutil: 0.227GB , pypy total allocated: 98.2MB
30 psutil: 0.227GB , pypy total allocated: 98.2MB
40 psutil: 0.227GB , pypy total allocated: 98.2MB
50 psutil: 0.227GB , pypy total allocated: 98.2MB
60 psutil: 0.227GB , pypy total allocated: 98.2MB
70 psutil: 0.227GB , pypy total allocated: 98.2MB
80 psutil: 0.227GB , pypy total allocated: 98.2MB
90 psutil: 0.227GB , pypy total allocated: 98.2MB
100 psutil: 0.227GB , pypy total allocated: 98.2MB
Total memory consumed:
    GC used:            85.2MB (peak: 97.2MB)
       in arenas:            62.0MB
       rawmalloced:          20.2MB
       nursery:              3.0MB
    raw assembler used: 664.6kB
    -----------------------------
    Total:              85.8MB

    Total memory allocated:
    GC allocated:            97.2MB (peak: 102.5MB)
       in arenas:            68.7MB
       rawmalloced:          25.5MB
       nursery:              3.0MB
    raw assembler allocated: 1.0MB
    -----------------------------
    Total:                   98.2MB

    Total time spent in GC:  11.162

So we should probably try to look at the np.asarray that we are using to convert memoryview back to arrays since that there is something going there.

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2023

We do not need those np.asarray calls when we pass pre-allocated out arguments to Cython functions. We can just remove them all. /cc @lorentzenchr.

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2023

We could report a minimal reproducer to Cython and/or PyPy so that they could work out a way to avoid this problem in the first place, +1 for simplifying our code anyways to drop the useless asarrays.

@mattip
Copy link

mattip commented Oct 25, 2023

Great news. A smaller reproducer would be welcome, although I imagine it hits the C-python mutual reference cycle problem I mentioned above. At least I could add it to the PyPy FAQ of "things to avoid".

@lesteve
Copy link
Member Author

lesteve commented Oct 26, 2023

FYI with the fix from #27662 (comment), the full test suite still uses ~9.5GB on PyPy so this saved about 1.5GB. The Azure VM has 7GB of RAM (according to this) so more effort is needed to find other places where similar issues occur ...

@glemaitre
Copy link
Member

@lesteve Did you modify all the np.asarray from the loss module?
I open #27670 and change the tests in case you want to give it a try if you only touch modify the loss function (and not the gradients, hessian, proba, etc.)

@glemaitre
Copy link
Member

I also recall that we use np.asarray to convert memory-view back to python object when we changed C buffer by a read-only memory-view that was previously not working with older Cython version.

@glemaitre
Copy link
Member

I check how many times the pattern happened:

> git grep "np.asarray(" | grep ".pyx:"  | wc -l
      37

We might need to fix those if we want to be on the safe side and hoping that what is in the asarray is already an array.

@glemaitre
Copy link
Member

I could come up with the following minimal reproducer based on @lesteve script.

%%cython

import psutil
import gc
from functools import partial
import platform

import cython
import numpy as np

IS_PYPY = platform.python_implementation() == "PyPy"


def cy_func(cython.floating[::1] arr_out):
    cdef:
        int i
        int size_arr = arr_out.shape[0]
    for i in range(size_arr):
        arr_out[i] = 10.0

    return np.asarray(arr_out)


def func():
    for _ in range(100):
        arr_out = np.ones(1_000_000, dtype=np.float64)
        cy_func(arr_out)


def main():
    for i in range(10):
        memory_usage = psutil.Process().memory_info().rss / 1e9
        message = f'{i} psutil: {memory_usage:.3f}GB'
        if IS_PYPY:
            pypy_memory_usage = gc.get_stats().memory_allocated_sum
            message += f' , pypy total allocated: {pypy_memory_usage}'
        print(message)

        func()
        
    if IS_PYPY:
        print(gc.get_stats())


if __name__ == '__main__':
    main()
0 psutil: 0.287GB , pypy total allocated: 193.2MB
1 psutil: 1.087GB , pypy total allocated: 183.0MB
2 psutil: 1.888GB , pypy total allocated: 183.0MB
3 psutil: 2.688GB , pypy total allocated: 183.1MB
4 psutil: 3.489GB , pypy total allocated: 183.0MB
5 psutil: 4.289GB , pypy total allocated: 183.1MB
6 psutil: 5.089GB , pypy total allocated: 183.0MB
7 psutil: 5.890GB , pypy total allocated: 183.0MB
8 psutil: 6.690GB , pypy total allocated: 183.0MB
9 psutil: 7.490GB , pypy total allocated: 183.1MB
Total memory consumed:
    GC used:            49.0MB (peak: 204.6MB)
       in arenas:            0.2kB
       rawmalloced:          46.0MB
       nursery:              3.0MB
    raw assembler used: 921.6kB
    -----------------------------
    Total:              49.9MB

    Total memory allocated:
    GC allocated:            181.0MB (peak: 207.9MB)
       in arenas:            128.7MB
       rawmalloced:          72.9MB
       nursery:              3.0MB
    raw assembler allocated: 2.0MB
    -----------------------------
    Total:                   183.0MB

    Total time spent in GC:  4.735

@mattip
Copy link

mattip commented Oct 26, 2023

Thanks for the minimum reproducer. I opened a pypy issue.

@ogrisel
Copy link
Member

ogrisel commented Oct 26, 2023

So maybe we can wait before attempting to fix the 37 occurrences of this pattern in scikit-learn. Please let us know @mattip if this is too complex to fix on the PyPy/Cython side: we could then change scikit-learn to avoid this pattern in the first place.

@lorentzenchr
Copy link
Member

np.asarray(memoryview) is THE canonical way of converting memoryviews to numpy arrays, see https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#coercion-to-numpy.
Removing this pattern should be a last and temporary resort as the bug is not caused by scikit-learn.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2023

I agree when the allocation is done by the Cython code itself. But in this case, the array is allocated (with numpy) and passed by the Python function as input argument to the Cython function. The Python caller could just return the original array it allocated and the Cython function would not return anything. This would be slightly more efficient (by creating a single numpy array instead of 2 numpy arrays that share the same data buffer).

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Nov 3, 2023
@lesteve
Copy link
Member Author

lesteve commented Nov 6, 2023

PyPy timed out 3 times out of the 6 last runs, this is going in the right direction but not quite there yet ...

When it does not time out there is a weird matplotlib issue, see #27648 (edit seemed like it was specific to matplotlib 3.8.0 for some reason)

@mattip
Copy link

mattip commented Nov 22, 2023

Saving the reproducer as test_pypy.pyx, and running cythonize test_pypy.pyx results in a c-extension module. I can use it to recreate the problem

$ pypy3.9 -m venv /tmp/venv3
$ source /tmp/venv3/bin/activate
$ python -m pip install numpy cython
$ cythonize temp_pypy
$ python
>>>> import test_pypy, numpy as np
>>>> arr_out = np.ones(1_000_000, dtype = np.float64)
>>>> a = test_pypy.cy_func(arr_out)
>>>> type(a)
<class 'numpy.ndarray'>
>>>> type(a.base)
<class 'memoryview'>
>>>> type(a.base.obj)
<class 'test_pypy._memoryviewslice'>
>>>> a.base.obj.base is arr_out
True

So cython has a class _memoryviewslice which has a __dealloc__ method, and hangs on to the original object, which is wrapped in a memoryview, which is wrapped in an ndarray. This is going to create gc cycles between objects in a way that I am not sure even the cpyext-gc-cycle branch will be able to untangle.

@lorentzenchr
Copy link
Member

Can we close as #27670 is merged?

@lesteve
Copy link
Member Author

lesteve commented Dec 1, 2023

I would keep this one open since I may have another look in the future. Having said that, this is not very high priority, since PyPy tests time out a lot rarely after #27670 was merged.

A quick look at #27750 seems to indicate that PyPy tests timed out 5 times since November 8 (i.e. in 23 days or so) so the situation is a lot better than previously when it was always timing out.

@mattip
Copy link

mattip commented Dec 1, 2023

The slowest test by far is tests/test_multioutput.py::test_classifier_chain_tuple_order[list], at 267 seconds where the next slowest is 162 seconds. Thanks for looking at this.

@lesteve
Copy link
Member Author

lesteve commented Dec 1, 2023

Thanks for the input! I said "timed out" which is not very accurate. I am reasonably confident the tests get killed because they use too much memory.

@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2023

I am reasonably confident the tests get killed because they use too much memory.

Which in turns causes the CI host to swap memory and slow the test run to the point of reaching the job-level timeout.

@lorentzenchr
Copy link
Member

We could very easily reduce the n_estimators of the RF in test_classifier_chain_tuple_order from the default 100 to 10.

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2024

As discussed at the last dev meeting, since none of us plans to invest more effort in fixing the root cause of the memory usage problem regularly observed on the PyPy nightly CI run, I posted the following call for help on our linkedin account:

and on my personal mastodon account:

feel free to re-post.

@lesteve
Copy link
Member Author

lesteve commented Mar 27, 2024

Thanks, I think if someone has some interest in scikit-learn on PyPy and wants to look at this, a good starting point could be to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants