Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an experimental "homebrew" CASA .image reader #607

Merged
merged 15 commits into from Feb 19, 2020

Conversation

keflavich
Copy link
Contributor

Partial solution to #605. I think this code now clearly-ish outlines what is needed to build an array from the bytes on disk; the next step is to daskify this reader process.

Also, I want a large series of tests to actually run. This is just what I did to get to the "good enough" stage. We really need to test this on different:

    1. numbers of dimensions (1,2,3,4,5)
    1. numbers of chunks (<=1, 2-10)

@keflavich
Copy link
Contributor Author

@astrofrog maybe give this a quick review? This might be the best solution (once dasked) to all of the CASA segfaulty problems

@coveralls
Copy link

coveralls commented Feb 9, 2020

Coverage Status

Coverage increased (+0.2%) to 88.169% when pulling ebb5ebc on keflavich:casa_numpy_reader into 31b0b51 on radio-astro-tools:master.

@astrofrog astrofrog self-requested a review February 9, 2020 21:33
@keflavich
Copy link
Contributor Author

so I definitely have this mostly-solved, but the steps of not-loading-into-memory and daskifying the reader are still a step beyond me. Creating our own slice, as we did with ArrayLikeCasaData, will definitely work, but it would require manual concatenation, etc. There has to be a better, simpler way.

In the latest commit, I added some commented-out code that shows how to get an n-dimensional array of the chunk indices. How you populate an n-dimensional array-like object without reading every byte is the part I haven't solved. It's probably trivial, though, and I'm just overlooking a single, simple function.

Copy link
Member

@astrofrog astrofrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall, although I assume that we'll also want to develop a dask + memmap version, correct? (EDIT: ah yes you mention this above!). I've also left a couple of comments below for discussion.

assert cut % 1 == 0
cut = int(cut)
rslt = [np.concatenate(rslt[ii::cut], kk) for ii in range(cut)]
jj += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more memory efficient way to do this which doesn't rely on too many temporary arrays as this does would be to just create the final array then use index to insert each chunk at the correct location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just saw the comments below :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I am irrationally confident that there is a trivial way to do what I'm trying to do in a single line, but I haven't figured out what that magical invocation is yet.


cube = SpectralCube.read(filename)

make_casa_testimage_of_shape(shape, tmp_path / 'casa.image')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any control over the chunking? If so would be good to try different chunking options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the chunking is on-disk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify, I mean can you create files on disk with different degrees of chunking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, no. CASA makes some choice for you, and I do not know of any way to control that. It's at least not obviously accessible in any of the current interfaces.

@astrofrog
Copy link
Member

astrofrog commented Feb 14, 2020

@keflavich - I've done some more work to improve performance, and have switched the CASA loader to use the dask reader by default. There is still some work to do, in particular in relation to masks. For now I have modified the code so that we can read in the data for the mask, except that (a) the dtype is probably wrong for datasets with masks, and (b) a lot of datasets I'm looking at have a very small mask file which must be effectively empty, so we need to handle this case. I'll continue work on this, but this is the update for now.

The casa_image_dask_reader is about as fast as I can make it now, so now I'll also need to investigate making sure SpectralCube.read is as fast as possible.

@astrofrog
Copy link
Member

astrofrog commented Feb 19, 2020

I've now managed to properly read in the masks (which are actually stored as sequences of bits, so this requires a different approach to the data), and I've also fixed it for 64-bit floating point data. I have all the tests passing locally. We can also read in different masks by name with the low-level function, and that could potentially be exposed from the SpectralCube reader.

At this point, the remaining to-dos are:

  • Clean up the CASA tests - I added a new detailed test of the low-level function, but I need to check if we can remove or combine some of the other tests
  • Make sure that the SpectralCube CASA reader (not the low-level function) is efficient memory wise and doesn't try and cause all the data to be read in

Otherwise I think this is close to ready. Note that for now I haven't seen performance benefits from using dask, in part due to the small chunk size, but what's important here is first to make sure things are correct and efficient memory-wise, and don't segfault!

@astrofrog
Copy link
Member

@keflavich - I think this is ready for review. One thing that I'd like to understand is how CASA chooses chunk sizes and if we can know for sure they will always be smaller than a certain size. If so, then we could replace the memmap with a fromfile since otherwise we have two 'layers' of lazy-loading - the dask chunks and the memmap within that, which maybe isn't needed. However, if chunks can be arbitrarily large, we should keep the memmap. In any case, this is not something that needs to hold back this PR. As-is, this PR seems to work and the tests pass. I've tried to make the internal dask reader as fast as possible but it can still take ~5 seconds for the data and mask each to construct the dask array when there are ~30,000 chunks.

# We open a file manually and return an in-memory copy of the array
# otherwise the file doesn't get closed properly.
with open(self._filename) as f:
return np.memmap(f, mode='readonly', order='F', **self._kwargs).T[item].copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a mess, but I see why it's needed.

There must be some way to define a np.memmap-like wrapper around mmap.mmap, the lower-level python function, that internally translates the slice syntax to the appropriate locations. I'm not sure whether that would be more performant, but it would at least prevent the many-open-files problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that could be a solution, I'll investigate whether it helpers with performance.

@keflavich keflavich merged commit 8bcd395 into radio-astro-tools:master Feb 19, 2020
@e-koch e-koch mentioned this pull request Feb 20, 2020
@keflavich keflavich deleted the casa_numpy_reader branch December 13, 2021 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants