-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: io: Fix savemat compression OverflowError #5327
Conversation
eternia478
commented
Oct 7, 2015
- Pre-compute byte counting to write data sequentially.
- Fix savemat OverflowError issue from compressing large matrices.
`VarWriter5.writetop` should precompute the byte count of the array using its current framework before saving it, allowing the write to be streamed instead of having the byte count written out of order. The current commit is for validating this functionality. The next one will actually revise how savemat works.
`VarWriter5.writetop` should precompute the byte count of the array using its current framework before saving it, allowing the write to be streamed instead of having the byte count written out of order. This removes the need for calls to update_matrix_tag, as the matrix tag only needs to be written once per item in the array instead of twice.
The main problem causing "OverflowError: size does not fit in an int" is an implementation issue of the zlib module of python 2.7. A simple way to resolve this is to simply use the compression object that the zlib module also includes to compress data parts at a time. Even in the case that a single write exceeds 2GB, the string can be split into multiple parts for the write.
This replaces #5325, right? You can remove the last commit of a PR by first removing it locally:
Then force-push to the PR's branch:
|
This needs a test; see |
Didn't want to do a messy push -f if I can. Oh well. Anyhow, sure, I can add a test. But like I said in the mailing list yesterday night, it will use a massive amount of memory and take up a lot of time. Below is a sample script to test (generation of a 2.3GB matrix takes 30 seconds and around 4GB), but I recommend something like 8 or 16 GB of memory for your computer. Before fixing:
After fixing:
import numpy as np
import scipy.io as spi
import scipy.sparse as sps
def generate_large_matrix():
shape = (100000, 100000)
density = 0.002
data_mat = sps.csc_matrix(shape, dtype=np.float64)
def generate_partial_large_matrix():
randarr = np.random.randint
dat = randarr(100, size=int(shape[0] * shape[1] * density))
row = randarr(shape[0], size=int(shape[0] * shape[1] * density))
col = randarr(shape[1], size=int(shape[0] * shape[1] * density))
ent = (dat, (row, col))
tmp_mat = sps.coo_matrix(ent, shape=shape, dtype=np.float64).tocsc()
return tmp_mat
for i in range(10):
data_mat = data_mat + generate_partial_large_matrix()
return data_mat
def getsize(mat):
return (mat.data.nbytes+mat.indices.nbytes+mat.indptr.nbytes)/(1000**3.0)
data_mat = generate_large_matrix()
print("Size of sparse matrix in GB: "+str(getsize(data_mat)))
import time
time.sleep(5)
spi.savemat("large_sparse.mat", {"data_mat": data_mat}, do_compression=True) |
Given the amount of resources it will use, will it even be appropriate to add into the set of tests? |
You could write a test that raises a |
rounds = int(np.ceil(1.0*len(filestr)/(2**30))) | ||
for idx in xrange(rounds): | ||
substr = filestr[idx * (2**30):(idx+1) * (2**30)] | ||
self.file_stream.write(self.zobj.compress(substr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be simplified to:
for idx in range(0, len(filestr), 2**30):
self.file_stream.write(self.zobj.compress(filestr[idx:idx + 2**30]))
it doesn't seem worthwhile to me to have such a slow memory consuming test for this one quite trivial branch |
I'm going to close this since it hasn't had any interest since the mailing list post over 8 years ago. If you think that the issue still persists in SciPy today @eternia478 , and would like to return to this, feel free to reopen! Although it would probably be easier to open a new PR... |