Skip to content

ENH: io/matlab: avoid making copies of zlib compressed data when loading files #519

Merged
merged 8 commits into from May 4, 2013

2 participants

@pv
SciPy member
pv commented Apr 20, 2013

The Matlab IO code makes some copies of the data when loading files with compressed data. This PR changes the code so that it avoids making any copies --- loading 1 GB of compressed data requires now roughly 1 GB of memory.

Closes Trac #1894 (note that the test file supplied by the user there is malformed --- do something like python2.7 -c 'import numpy as np; from scipy.io import savemat; x=np.zeros([int(500e6/8)]); savemat("foo.mat", dict(x=x), do_compression=1)' to generate a test file.)

@pv
SciPy member
pv commented Apr 20, 2013
@matthew-brett

Thanks for doing this.

I seem to remember the reason I read the data in all in one go was to be able to use the Cython cStringStream to provide data without calling back into Python. I may have believed this was quicker, and I would maybe have tested this with the results of matlab/benchmarks/bench_structarr. I also have a horrible file someone sent me, which was very slow to load, and that I can share, but I don't think we can make it public.

@pv
SciPy member
pv commented Apr 20, 2013

I didn't see observable changes in performance in my tests with these changes; the block size should be large enough to ensure that time is spent in I/O or inside zlib.

@matthew-brett

My worry was not the IO, but that there can be a very large number of calls to get data from the stream. If the stream is a wrapped string, this can be done at c-level, but if it's a custom file object, the calls have to come up to Python level. I've sent you an invitation to a shared dropbox folder with a file that was very slow to load.

@pv
SciPy member
pv commented Apr 20, 2013

Ok, there was a 4x speed drop for a compressed version of that file, which could be pushed down to 2x in pure Python.

However, here's another go at it, in Cython, since the streams framework was already in place. There is no speed difference any more, and memory usage is the 2x less, as advertised:

# before
$ runlim python2.7 -c 'from scipy.io import loadmat; loadmat("nestedStructedSyptomCompressed.mat")'
[runlim] time:          10.35 seconds
[runlim] space:         792.8 MB
$ runlim python2.7 -c 'from scipy.io import loadmat; loadmat("state-Chromosome-damagedBases.mat")'
[runlim] time:          3.37 seconds
[runlim] space:         2062.6 MB

# after
$ runlim python2.7 -c 'from scipy.io import loadmat; loadmat("nestedStructedSyptomCompressed.mat")'
[runlim] time:          10.64 seconds
[runlim] space:         703.6 MB
$ runlim python2.7 -c 'from scipy.io import loadmat; loadmat("state-Chromosome-damagedBases.mat")'
[runlim] time:          3.37 seconds
[runlim] space:         1084.4 MB
@matthew-brett

Great job - thank you very much for doing that.

I'm sorry to ask - but did the benchmarks catch the slowdown? I was trying to capture the slowness of the test file there, imperfectly.

@matthew-brett

Might be a good idea to move these comments into the ZlibInputStream object?

@matthew-brett

parameter is max_length

@matthew-brett

Note that negative numbers mean no maximum length, read until end of buffer

@pv
SciPy member
pv commented Apr 21, 2013

The benchmarks did not seem to indicate a slowdown, also with the Python code.

@matthew-brett
@pv
SciPy member
pv commented Apr 21, 2013

Oops, the reason for that was that the compression wasn't turned on. Now that I check it, it does exhibit the same 4x with the old Python code in 4fac312. With 107c7f5 there is no slowdown.

@pv
SciPy member
pv commented Apr 21, 2013

And moved code comments around + fixed the benchmark.

@matthew-brett

That's great - thanks. Can you think of a way of checking the memory use with a benchmark or something? An annoying thing to check, I know. It's certainly not obvious to me how to do it.

@pv
SciPy member
pv commented Apr 21, 2013

Monitor status from /proc every few ms?

@pv
SciPy member
pv commented Apr 21, 2013

Plus mem benchmark.

Some things to consider for later (out of scope for this PR): savemat(... do_compression=True) consumes ~ 5x the memory of the data to be saved (probably worst case --- incompressible data). Without compression, 2x

@matthew-brett

Fails presumably on Windows, OSX? try: ... except IOError: raise SkipTest or similar?

@matthew-brett

Wow. Many thanks for this. I owe you several expensive Belgian beers.

@matthew-brett

I would have thought it was a good idea to skip the memory benchmark for platforms on which it can't run, but otherwise, it would be excellent to merge this one.

@matthew-brett

WRT memory inefficiency of savemat - see gh-1896

@pv
SciPy member
pv commented Apr 26, 2013

Test skipping sorted out. I didn't even know that Numpy's testing platform can run benchmarks.

@pv
SciPy member
pv commented Apr 30, 2013

Unless further comments, I'll merge this in 18 hours

@pv pv merged commit 9af4bb9 into scipy:master May 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.