BUG: io: Fix savemat compression OverflowError #5327

eternia478 · 2015-10-07T05:09:58Z

Pre-compute byte counting to write data sequentially.
Fix savemat OverflowError issue from compressing large matrices.

`VarWriter5.writetop` should precompute the byte count of the array using its current framework before saving it, allowing the write to be streamed instead of having the byte count written out of order. The current commit is for validating this functionality. The next one will actually revise how savemat works.

`VarWriter5.writetop` should precompute the byte count of the array using its current framework before saving it, allowing the write to be streamed instead of having the byte count written out of order. This removes the need for calls to update_matrix_tag, as the matrix tag only needs to be written once per item in the array instead of twice.

The main problem causing "OverflowError: size does not fit in an int" is an implementation issue of the zlib module of python 2.7. A simple way to resolve this is to simply use the compression object that the zlib module also includes to compress data parts at a time. Even in the case that a single write exceeds 2GB, the string can be split into multiple parts for the write.

larsmans · 2015-10-07T07:59:58Z

This replaces #5325, right? You can remove the last commit of a PR by first removing it locally:

git reset --hard HEAD^    # (or use git rebase -i)

Then force-push to the PR's branch:

git push -f origin io_savemat_fixes_v5

larsmans · 2015-10-07T08:01:03Z

This needs a test; see scipy/io/matlab/tests/.

eternia478 · 2015-10-07T17:08:55Z

Didn't want to do a messy push -f if I can. Oh well.

Anyhow, sure, I can add a test. But like I said in the mailing list yesterday night, it will use a massive amount of memory and take up a lot of time. Below is a sample script to test (generation of a 2.3GB matrix takes 30 seconds and around 4GB), but I recommend something like 8 or 16 GB of memory for your computer.

Before fixing:

Python2.7: 30 seconds, OverflowError: size does not fit in an int
Python3.4: 3:30 minutes, Successful, but uses up 7 GB memory (~3 times of matrix size)

After fixing:

Python 2.7: 3:30 minutes, Successful, little more than 3 GB memory (~1.6 times of matrix)
Python 3.4: Same as python 2.7

import numpy as np
import scipy.io as spi
import scipy.sparse as sps

def generate_large_matrix():
    shape = (100000, 100000)
    density = 0.002
    data_mat = sps.csc_matrix(shape, dtype=np.float64)
    def generate_partial_large_matrix():
        randarr = np.random.randint
        dat = randarr(100, size=int(shape[0] * shape[1] * density))
        row = randarr(shape[0], size=int(shape[0] * shape[1] * density))
        col = randarr(shape[1], size=int(shape[0] * shape[1] * density))
        ent = (dat, (row, col))
        tmp_mat = sps.coo_matrix(ent, shape=shape, dtype=np.float64).tocsc()
        return tmp_mat
    for i in range(10):
        data_mat = data_mat + generate_partial_large_matrix()
    return data_mat

def getsize(mat):
    return (mat.data.nbytes+mat.indices.nbytes+mat.indptr.nbytes)/(1000**3.0)

data_mat = generate_large_matrix()
print("Size of sparse matrix in GB: "+str(getsize(data_mat)))
import time
time.sleep(5)
spi.savemat("large_sparse.mat", {"data_mat": data_mat}, do_compression=True)

eternia478 · 2015-10-07T17:11:47Z

Given the amount of resources it will use, will it even be appropriate to add into the set of tests?

larsmans · 2015-10-10T12:36:33Z

You could write a test that raises a SkipTest unless SCIPY_XSLOW=1. See HACKING.rst.txt for explanation, scipy/sparse/tests/test_sparsetools.py for an example.

juliantaylor · 2015-11-16T23:23:59Z

scipy/io/matlab/mio5.py

+            rounds = int(np.ceil(1.0*len(filestr)/(2**30)))
+            for idx in xrange(rounds):
+                substr = filestr[idx * (2**30):(idx+1) * (2**30)]
+                self.file_stream.write(self.zobj.compress(substr))


this could be simplified to:

for idx in range(0, len(filestr), 2**30): self.file_stream.write(self.zobj.compress(filestr[idx:idx + 2**30]))

juliantaylor · 2015-11-16T23:25:41Z

it doesn't seem worthwhile to me to have such a slow memory consuming test for this one quite trivial branch

lucascolley · 2024-01-07T15:52:34Z

I'm going to close this since it hasn't had any interest since the mailing list post over 8 years ago.

If you think that the issue still persists in SciPy today @eternia478 , and would like to return to this, feel free to reopen! Although it would probably be easier to open a new PR...

eternia478 added 3 commits October 3, 2015 21:03

larsmans added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.io labels Oct 7, 2015

ev-br added the needs-work Items that are pending response from the author label Nov 9, 2015

juliantaylor reviewed Nov 16, 2015
View reviewed changes

lucascolley closed this Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: io: Fix savemat compression OverflowError #5327

BUG: io: Fix savemat compression OverflowError #5327

eternia478 commented Oct 7, 2015

larsmans commented Oct 7, 2015

larsmans commented Oct 7, 2015

eternia478 commented Oct 7, 2015

eternia478 commented Oct 7, 2015

larsmans commented Oct 10, 2015

juliantaylor Nov 16, 2015

juliantaylor commented Nov 16, 2015

lucascolley commented Jan 7, 2024

BUG: io: Fix savemat compression OverflowError #5327

BUG: io: Fix savemat compression OverflowError #5327

Conversation

eternia478 commented Oct 7, 2015

larsmans commented Oct 7, 2015

larsmans commented Oct 7, 2015

eternia478 commented Oct 7, 2015

eternia478 commented Oct 7, 2015

larsmans commented Oct 10, 2015

juliantaylor Nov 16, 2015

Choose a reason for hiding this comment

juliantaylor commented Nov 16, 2015

lucascolley commented Jan 7, 2024