Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory error for large arrays #10

Open
tglauch opened this issue Jan 12, 2019 · 5 comments
Open

Memory error for large arrays #10

tglauch opened this issue Jan 12, 2019 · 5 comments
Assignees

Comments

@tglauch
Copy link

tglauch commented Jan 12, 2019

Hi,

First: thanks you for the great package, it works really well! However I experience something that I don't quite understand. I'm fitting a pretty large 4D KDE and it all works fine if a evaluate on a grid of the size

nbins_x, nbins_y, nbins_z, nbins_z2 = 200, 30, 80, 500

but once I increase the resolution a bit to let's say

nbins_x, nbins_y, nbins_z, nbins_z2 = 200, 40, 80, 500

I get a memory error. I have also profiled the memory consumption in the first and second case

1.)

5565.734 MiB 3628.438 MiB   bins1 = np.array(list(itertools.product(bins_Et, bins_sind, bins_Er, bins_dang)))
14721.020 MiB 9155.285 MiB    y = KDE3d.evaluate(bins1)

2.)

6776.355 MiB 4849.047 MiB   bins1 = np.array(list(itertools.product(bins_Et, bins_sind, bins_Er, bins_dang))) 
53828.105 MiB 47051.750 MiB       y = KDE3d.evaluate(bins1)

As you see the memory consumption of the evaluate function suddenly jumps up from 9GB to 47GB without any obvious reason. I unfortunately currently don't have time to investigate this in more detail. But you might have an idea?! If it helps: I experience the same behavior on different computers.

Here the traceback:

  File "KDEs_new.py", line 197, in main
    y = KDE3d.evaluate(bins1)
  File "/home/ga53lag/Software/pyv3/lib/python3.7/site-packages/KDEpy/FFTKDE.py", line 200, in evaluate
    ans = convolve(data, kernel_weights, mode='same').reshape(-1, 1)
  File "/home/ga53lag/Software/pyv3/lib/python3.7/site-packages/scipy/signal/signaltools.py", line 802, in convolve
    out = fftconvolve(volume, kernel, mode=mode)
  File "/home/ga53lag/Software/pyv3/lib/python3.7/site-packages/scipy/signal/signaltools.py", line 415, in fftconvolve
    ret = np.fft.irfftn(sp1 * sp2, fshape, axes=axes)[fslice].copy()
  File "/home/ga53lag/Software/pyv3/lib/python3.7/site-packages/numpy/fft/fftpack.py", line 1232, in irfftn
    a = irfft(a, s[-1], axes[-1], norm)
  File "/home/ga53lag/Software/pyv3/lib/python3.7/site-packages/numpy/fft/fftpack.py", line 466, in irfft
    _real_fft_cache)
  File "/home/ga53lag/Software/pyv3/lib/python3.7/site-packages/numpy/fft/fftpack.py", line 83, in _raw_fft
    r = work_function(a, wsave)
MemoryError

Thanks a lot,
Theo

@tglauch tglauch changed the title Memory Error for large Arrays Memory error for large arrays Jan 12, 2019
@tommyod
Copy link
Owner

tommyod commented Jan 12, 2019

Thanks for letting me know about this @tglauch . I will look at it when time permits it. I have to admit I am not sure why it happens. Are you able to post code reproducing the problem? E.g. using random data and a seed?

@tommyod tommyod self-assigned this Jan 12, 2019
@tommyod
Copy link
Owner

tommyod commented Jan 12, 2019

Seems to be related to the convolution in scipy.signal.convolve based on the traceback.

@tglauch
Copy link
Author

tglauch commented Jan 12, 2019

Hi,
yes. I agree. I was already checking for a known memory leak in scipy.signal.convolve, but I couldn't find one.

I'll try to provide an example code asap.

Cheers.

@tglauch
Copy link
Author

tglauch commented Jan 13, 2019

Here we go with a snippet

import numpy as np
import itertools
from KDEpy import FFTKDE

@profile
def run_code():
    nbins_x, nbins_y, nbins_z, nbins_z2 = 120, 30, 100, 200
    bins_x= np.linspace(0,1.0,nbins_x, dtype = np.float32)
    bins_y = np.linspace(0,1.0, nbins_y, dtype = np.float32)
    bins_z = np.linspace(0,1.0,nbins_z, dtype = np.float32)
    bins_z2 = np.linspace(0,1.0, nbins_z2, dtype = np.float32)

    np.random.seed(1)
    x = np.random.random(int(3e6))
    w = np.random.random(int(3e6))
    #Generate KDE1
    inp_data = np.column_stack((x, x, x, x))

    print('start KDE1')
    KDE = FFTKDE(kernel='gaussian', bw=0.1).fit(inp_data, w)
    print('evaluate bins')
    bins1 = np.array(list(itertools.product(bins_x, bins_y, bins_z, bins_z2)))
    y = KDE.evaluate(bins1)

if __name__ == '__main__':
    run_code()

Changing slightly the binning brings you from 2GB

Line #    Mem usage    Increment   Line Contents
================================================
     5   42.320 MiB   42.320 MiB   @profile
     6                             def run_code():
     7   42.320 MiB    0.000 MiB       nbins_x, nbins_y, nbins_z, nbins_z2 = 110, 30, 100, 200
     8   42.320 MiB    0.000 MiB       bins_x= np.linspace(0,1.0,nbins_x, dtype = np.float32)
     9   42.320 MiB    0.000 MiB       bins_y = np.linspace(0,1.0, nbins_y, dtype = np.float32)
    10   42.320 MiB    0.000 MiB       bins_z = np.linspace(0,1.0,nbins_z, dtype = np.float32)
    11   42.320 MiB    0.000 MiB       bins_z2 = np.linspace(0,1.0, nbins_z2, dtype = np.float32)
    12                             
    13   42.320 MiB    0.000 MiB       np.random.seed(1)
    14   65.293 MiB   22.973 MiB       x = np.random.random(int(3e6))
    15   88.090 MiB   22.797 MiB       w = np.random.random(int(3e6))
    16                                 #Generate KDE1
    17  179.719 MiB   91.629 MiB       inp_data = np.column_stack((x, x, x, x))
    18                             
    19  179.719 MiB    0.000 MiB       print('start KDE1')
    20  205.508 MiB   25.789 MiB       KDE = FFTKDE(kernel='gaussian', bw=0.1).fit(inp_data, w)
    21  205.508 MiB    0.000 MiB       print('evaluate bins')
    22 1226.445 MiB 1020.938 MiB       bins1 = np.array(list(itertools.product(bins_x, bins_y, bins_z, bins_z2)))
    23 3515.254 MiB 2288.809 MiB       y = KDE.evaluate(bins1)

to 52GB

Line #    Mem usage    Increment   Line Contents
================================================
     5   42.320 MiB   42.320 MiB   @profile
     6                             def run_code():
     7   42.320 MiB    0.000 MiB       nbins_x, nbins_y, nbins_z, nbins_z2 = 120, 30, 100, 200
     8   42.320 MiB    0.000 MiB       bins_x= np.linspace(0,1.0,nbins_x, dtype = np.float32)
     9   42.320 MiB    0.000 MiB       bins_y = np.linspace(0,1.0, nbins_y, dtype = np.float32)
    10   42.320 MiB    0.000 MiB       bins_z = np.linspace(0,1.0,nbins_z, dtype = np.float32)
    11   42.320 MiB    0.000 MiB       bins_z2 = np.linspace(0,1.0, nbins_z2, dtype = np.float32)
    12                             
    13   42.320 MiB    0.000 MiB       np.random.seed(1)
    14   65.355 MiB   23.035 MiB       x = np.random.random(int(3e6))
    15   88.086 MiB   22.730 MiB       w = np.random.random(int(3e6))
    16                                 #Generate KDE1
    17  179.781 MiB   91.695 MiB       inp_data = np.column_stack((x, x, x, x))
    18                             
    19  179.781 MiB    0.000 MiB       print('start KDE1')
    20  205.637 MiB   25.855 MiB       KDE = FFTKDE(kernel='gaussian', bw=0.1).fit(inp_data, w)
    21  205.637 MiB    0.000 MiB       print('evaluate bins')
    22 1318.488 MiB 1112.852 MiB       bins1 = np.array(list(itertools.product(bins_x, bins_y, bins_z, bins_z2)))
    23 53852.961 MiB 52534.473 MiB       y = KDE.evaluate(bins1)

Let me add:
numpy version is 1.15.4

@tommyod
Copy link
Owner

tommyod commented Jan 14, 2019

Hi again @tglauch . I'm afraid I won't be of much help regarding this. Running the code snippet you provided grinds my desktop computer to a halt, so it's a difficult problem to debug. Again, I believe the problem is related to scipy.signal.convolve based on the traceback.

Ideas you could try for debugging.

  • Check if the memory jump still happens when the bin count in the other dimensions are decreased.
  • Check if the memory jump still happens when the shape is permuted, e.g. (a, b, c, d) -> (b, c, d, a).
  • Check if the memory jump can be reproduced when convolving a random 4D array of the same shape as in the provided example. In other words, convolving a 4D array of shape (120, 30, 100, 200) with a smaller 4D kernel. This is essentially what FFTKDE does in the line ans = convolve(data, kernel_weights, mode='same').reshape(-1, 1).

An idea for working around this

import numpy as np

def run_code():
    
    # Bins in each dimension, 50 * 30 * 80 * 100 = 12 million grid points
    nbins_x, nbins_y, nbins_z, nbins_z2 = 50, 30, 80, 100

    # Generate random data and weights 
    np.random.seed(1)
    x = np.random.random(int(3e6))
    w = np.random.random(int(3e6))
    inp_data = np.column_stack((x, x, x, x))
    
    # Utility functions for automatic grids and binning
    from KDEpy.utils import autogrid
    from KDEpy.binning import linear_binning
    
    # Create a grid. Maximum of relative and absolute boundary limits is used
    # Here the grid will go from -0.5 to 1.5, not -0.05 to 1.05
    grid = autogrid(inp_data, boundary_abs=0.5,
                    num_points=(nbins_x, nbins_y, nbins_z, nbins_z2), 
                    boundary_rel=0.05)
    
    print(grid[0, :])  # Check boundaries
    print(grid.shape)
    
    # Bin the data on the grid points, the weights are used
    data_vals = linear_binning(inp_data, grid, weights=w)
    
    # At this point, `data_vals` is a linear approximation to your data, at the points `grid`.
    # You can try to convolve this 4D weighted compression of your data with
    # a kernel, or feed this to a different algorithm, but it must handle weighted data.

if __name__ == '__main__':
    run_code()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants