New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an experimental "homebrew" CASA .image reader #607
Add an experimental "homebrew" CASA .image reader #607
Conversation
@astrofrog maybe give this a quick review? This might be the best solution (once dasked) to all of the CASA segfaulty problems |
np.concatenate step, and I haven't found a workaround for that
so I definitely have this mostly-solved, but the steps of not-loading-into-memory and daskifying the reader are still a step beyond me. Creating our own slice, as we did with In the latest commit, I added some commented-out code that shows how to get an n-dimensional array of the chunk indices. How you populate an n-dimensional array-like object without reading every byte is the part I haven't solved. It's probably trivial, though, and I'm just overlooking a single, simple function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great overall, although I assume that we'll also want to develop a dask + memmap version, correct? (EDIT: ah yes you mention this above!). I've also left a couple of comments below for discussion.
spectral_cube/io/casa_image.py
Outdated
assert cut % 1 == 0 | ||
cut = int(cut) | ||
rslt = [np.concatenate(rslt[ii::cut], kk) for ii in range(cut)] | ||
jj += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more memory efficient way to do this which doesn't rely on too many temporary arrays as this does would be to just create the final array then use index to insert each chunk at the correct location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just saw the comments below :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I am irrationally confident that there is a trivial way to do what I'm trying to do in a single line, but I haven't figured out what that magical invocation is yet.
|
||
cube = SpectralCube.read(filename) | ||
|
||
make_casa_testimage_of_shape(shape, tmp_path / 'casa.image') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any control over the chunking? If so would be good to try different chunking options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the chunking is on-disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify, I mean can you create files on disk with different degrees of chunking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, no. CASA makes some choice for you, and I do not know of any way to control that. It's at least not obviously accessible in any of the current interfaces.
@keflavich - I've done some more work to improve performance, and have switched the CASA loader to use the dask reader by default. There is still some work to do, in particular in relation to masks. For now I have modified the code so that we can read in the data for the mask, except that (a) the dtype is probably wrong for datasets with masks, and (b) a lot of datasets I'm looking at have a very small mask file which must be effectively empty, so we need to handle this case. I'll continue work on this, but this is the update for now. The |
…it floating point data
I've now managed to properly read in the masks (which are actually stored as sequences of bits, so this requires a different approach to the data), and I've also fixed it for 64-bit floating point data. I have all the tests passing locally. We can also read in different masks by name with the low-level function, and that could potentially be exposed from the SpectralCube reader. At this point, the remaining to-dos are:
Otherwise I think this is close to ready. Note that for now I haven't seen performance benefits from using dask, in part due to the small chunk size, but what's important here is first to make sure things are correct and efficient memory-wise, and don't segfault! |
@keflavich - I think this is ready for review. One thing that I'd like to understand is how CASA chooses chunk sizes and if we can know for sure they will always be smaller than a certain size. If so, then we could replace the memmap with a fromfile since otherwise we have two 'layers' of lazy-loading - the dask chunks and the memmap within that, which maybe isn't needed. However, if chunks can be arbitrarily large, we should keep the memmap. In any case, this is not something that needs to hold back this PR. As-is, this PR seems to work and the tests pass. I've tried to make the internal dask reader as fast as possible but it can still take ~5 seconds for the data and mask each to construct the dask array when there are ~30,000 chunks. |
# We open a file manually and return an in-memory copy of the array | ||
# otherwise the file doesn't get closed properly. | ||
with open(self._filename) as f: | ||
return np.memmap(f, mode='readonly', order='F', **self._kwargs).T[item].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of a mess, but I see why it's needed.
There must be some way to define a np.memmap
-like wrapper around mmap.mmap
, the lower-level python function, that internally translates the slice syntax to the appropriate locations. I'm not sure whether that would be more performant, but it would at least prevent the many-open-files problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that could be a solution, I'll investigate whether it helpers with performance.
Partial solution to #605. I think this code now clearly-ish outlines what is needed to build an array from the bytes on disk; the next step is to daskify this reader process.
Also, I want a large series of tests to actually run. This is just what I did to get to the "good enough" stage. We really need to test this on different: