Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: io: Fix savemat compression OverflowError #5327

Closed
wants to merge 3 commits into from

Conversation

eternia478
Copy link

  1. Pre-compute byte counting to write data sequentially.
  2. Fix savemat OverflowError issue from compressing large matrices.

`VarWriter5.writetop` should precompute the byte count of the array
using its current framework before saving it, allowing the write to be
streamed instead of having the byte count written out of order.

The current commit is for validating this functionality. The next one
will actually revise how savemat works.
`VarWriter5.writetop` should precompute the byte count of the array
using its current framework before saving it, allowing the write to be
streamed instead of having the byte count written out of order.

This removes the need for calls to update_matrix_tag, as the matrix tag
only needs to be written once per item in the array instead of twice.
The main problem causing "OverflowError: size does not fit in an int"
is an implementation issue of the zlib module of python 2.7. A simple
way to resolve this is to simply use the compression object that
the zlib module also includes to compress data parts at a time. Even
in the case that a single write exceeds 2GB, the string can be split
into multiple parts for the write.
@larsmans
Copy link
Contributor

larsmans commented Oct 7, 2015

This replaces #5325, right? You can remove the last commit of a PR by first removing it locally:

git reset --hard HEAD^    # (or use git rebase -i)

Then force-push to the PR's branch:

git push -f origin io_savemat_fixes_v5

@larsmans
Copy link
Contributor

larsmans commented Oct 7, 2015

This needs a test; see scipy/io/matlab/tests/.

@larsmans larsmans added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.io labels Oct 7, 2015
@eternia478
Copy link
Author

Didn't want to do a messy push -f if I can. Oh well.

Anyhow, sure, I can add a test. But like I said in the mailing list yesterday night, it will use a massive amount of memory and take up a lot of time. Below is a sample script to test (generation of a 2.3GB matrix takes 30 seconds and around 4GB), but I recommend something like 8 or 16 GB of memory for your computer.

Before fixing:

  • Python2.7: 30 seconds, OverflowError: size does not fit in an int
  • Python3.4: 3:30 minutes, Successful, but uses up 7 GB memory (~3 times of matrix size)

After fixing:

  • Python 2.7: 3:30 minutes, Successful, little more than 3 GB memory (~1.6 times of matrix)
  • Python 3.4: Same as python 2.7
import numpy as np
import scipy.io as spi
import scipy.sparse as sps

def generate_large_matrix():
    shape = (100000, 100000)
    density = 0.002
    data_mat = sps.csc_matrix(shape, dtype=np.float64)
    def generate_partial_large_matrix():
        randarr = np.random.randint
        dat = randarr(100, size=int(shape[0] * shape[1] * density))
        row = randarr(shape[0], size=int(shape[0] * shape[1] * density))
        col = randarr(shape[1], size=int(shape[0] * shape[1] * density))
        ent = (dat, (row, col))
        tmp_mat = sps.coo_matrix(ent, shape=shape, dtype=np.float64).tocsc()
        return tmp_mat
    for i in range(10):
        data_mat = data_mat + generate_partial_large_matrix()
    return data_mat

def getsize(mat):
    return (mat.data.nbytes+mat.indices.nbytes+mat.indptr.nbytes)/(1000**3.0)

data_mat = generate_large_matrix()
print("Size of sparse matrix in GB: "+str(getsize(data_mat)))
import time
time.sleep(5)
spi.savemat("large_sparse.mat", {"data_mat": data_mat}, do_compression=True)

@eternia478
Copy link
Author

Given the amount of resources it will use, will it even be appropriate to add into the set of tests?

@larsmans
Copy link
Contributor

You could write a test that raises a SkipTest unless SCIPY_XSLOW=1. See HACKING.rst.txt for explanation, scipy/sparse/tests/test_sparsetools.py for an example.

@ev-br ev-br added the needs-work Items that are pending response from the author label Nov 9, 2015
rounds = int(np.ceil(1.0*len(filestr)/(2**30)))
for idx in xrange(rounds):
substr = filestr[idx * (2**30):(idx+1) * (2**30)]
self.file_stream.write(self.zobj.compress(substr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be simplified to:

for idx in range(0, len(filestr), 2**30):
    self.file_stream.write(self.zobj.compress(filestr[idx:idx + 2**30]))

@juliantaylor
Copy link
Contributor

it doesn't seem worthwhile to me to have such a slow memory consuming test for this one quite trivial branch

@lucascolley
Copy link
Member

I'm going to close this since it hasn't had any interest since the mailing list post over 8 years ago.

If you think that the issue still persists in SciPy today @eternia478 , and would like to return to this, feel free to reopen! Although it would probably be easier to open a new PR...

@lucascolley lucascolley closed this Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected needs-work Items that are pending response from the author scipy.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants