Add an experimental "homebrew" CASA .image reader #607

keflavich · 2020-02-09T16:26:18Z

Partial solution to #605. I think this code now clearly-ish outlines what is needed to build an array from the bytes on disk; the next step is to daskify this reader process.

Also, I want a large series of tests to actually run. This is just what I did to get to the "good enough" stage. We really need to test this on different:

1. numbers of dimensions (1,2,3,4,5)
1. numbers of chunks (<=1, 2-10)

keflavich · 2020-02-09T16:26:51Z

@astrofrog maybe give this a quick review? This might be the best solution (once dasked) to all of the CASA segfaulty problems

coveralls · 2020-02-09T16:35:49Z

Coverage increased (+0.2%) to 88.169% when pulling ebb5ebc on keflavich:casa_numpy_reader into 31b0b51 on radio-astro-tools:master.

np.concatenate step, and I haven't found a workaround for that

keflavich · 2020-02-09T22:39:22Z

so I definitely have this mostly-solved, but the steps of not-loading-into-memory and daskifying the reader are still a step beyond me. Creating our own slice, as we did with ArrayLikeCasaData, will definitely work, but it would require manual concatenation, etc. There has to be a better, simpler way.

In the latest commit, I added some commented-out code that shows how to get an n-dimensional array of the chunk indices. How you populate an n-dimensional array-like object without reading every byte is the part I haven't solved. It's probably trivial, though, and I'm just overlooking a single, simple function.

astrofrog

This looks great overall, although I assume that we'll also want to develop a dask + memmap version, correct? (EDIT: ah yes you mention this above!). I've also left a couple of comments below for discussion.

astrofrog · 2020-02-10T17:42:30Z

spectral_cube/io/casa_image.py

+            assert cut % 1 == 0
+            cut = int(cut)
+            rslt = [np.concatenate(rslt[ii::cut], kk) for ii in range(cut)]
+        jj += 1


A more memory efficient way to do this which doesn't rely on too many temporary arrays as this does would be to just create the final array then use index to insert each chunk at the correct location.

And just saw the comments below :)

Yep. I am irrationally confident that there is a trivial way to do what I'm trying to do in a single line, but I haven't figured out what that magical invocation is yet.

astrofrog · 2020-02-10T17:43:26Z

spectral_cube/tests/test_casafuncs.py

+
+    cube = SpectralCube.read(filename)
+
+    make_casa_testimage_of_shape(shape, tmp_path / 'casa.image')


Do you have any control over the chunking? If so would be good to try different chunking options?

no, the chunking is on-disk

just to clarify, I mean can you create files on disk with different degrees of chunking?

afaik, no. CASA makes some choice for you, and I do not know of any way to control that. It's at least not obviously accessible in any of the current interfaces.

…ading data, and remove unused code

astrofrog · 2020-02-14T18:37:48Z

@keflavich - I've done some more work to improve performance, and have switched the CASA loader to use the dask reader by default. There is still some work to do, in particular in relation to masks. For now I have modified the code so that we can read in the data for the mask, except that (a) the dtype is probably wrong for datasets with masks, and (b) a lot of datasets I'm looking at have a very small mask file which must be effectively empty, so we need to handle this case. I'll continue work on this, but this is the update for now.

The casa_image_dask_reader is about as fast as I can make it now, so now I'll also need to investigate making sure SpectralCube.read is as fast as possible.

…it floating point data

astrofrog · 2020-02-19T12:35:37Z

I've now managed to properly read in the masks (which are actually stored as sequences of bits, so this requires a different approach to the data), and I've also fixed it for 64-bit floating point data. I have all the tests passing locally. We can also read in different masks by name with the low-level function, and that could potentially be exposed from the SpectralCube reader.

At this point, the remaining to-dos are:

Clean up the CASA tests - I added a new detailed test of the low-level function, but I need to check if we can remove or combine some of the other tests
Make sure that the SpectralCube CASA reader (not the low-level function) is efficient memory wise and doesn't try and cause all the data to be read in

Otherwise I think this is close to ready. Note that for now I haven't seen performance benefits from using dask, in part due to the small chunk size, but what's important here is first to make sure things are correct and efficient memory-wise, and don't segfault!

astrofrog · 2020-02-19T16:16:36Z

@keflavich - I think this is ready for review. One thing that I'd like to understand is how CASA chooses chunk sizes and if we can know for sure they will always be smaller than a certain size. If so, then we could replace the memmap with a fromfile since otherwise we have two 'layers' of lazy-loading - the dask chunks and the memmap within that, which maybe isn't needed. However, if chunks can be arbitrarily large, we should keep the memmap. In any case, this is not something that needs to hold back this PR. As-is, this PR seems to work and the tests pass. I've tried to make the internal dask reader as fast as possible but it can still take ~5 seconds for the data and mask each to construct the dask array when there are ~30,000 chunks.

keflavich · 2020-02-19T20:38:52Z

spectral_cube/io/casa_image.py

+        # We open a file manually and return an in-memory copy of the array
+        # otherwise the file doesn't get closed properly.
+        with open(self._filename) as f:
+            return np.memmap(f, mode='readonly', order='F', **self._kwargs).T[item].copy()


This is kind of a mess, but I see why it's needed.

There must be some way to define a np.memmap-like wrapper around mmap.mmap, the lower-level python function, that internally translates the slice syntax to the appropriate locations. I'm not sure whether that would be more performant, but it would at least prevent the many-open-files problems.

Yes that could be a solution, I'll investigate whether it helpers with performance.

add an experimental "homebrew" CASA .image reader.

0300b8b

astrofrog self-requested a review February 9, 2020 21:33

keflavich added 3 commits February 9, 2020 16:36

move functions around, add documentation add tests

702a053

remove the placeholder file

c3a4faf

switch to a memory-map version. It still gets read into memory at the

d43c160

np.concatenate step, and I haven't found a workaround for that

astrofrog reviewed Feb 10, 2020

View reviewed changes

astrofrog added 4 commits February 11, 2020 17:47

Fix bug in Numpy CASA loader and add dask-based loader

770714d

Use dask.array.block to avoid many concatenate calls

6e4908c

Explicitly convert chunks to dask arrays

e386235

Improve performance of dask CASA reader, switch to using that when lo…

09b43c1

…ading data, and remove unused code

astrofrog added 2 commits February 19, 2020 11:50

Fix support for reading in masks, and add support for reading in 64-b…

b55d3a7

…it floating point data

Fix tests

85fae67

astrofrog added the enhancement label Feb 19, 2020

astrofrog added 5 commits February 19, 2020 14:03

Fix compatibility with latest developer version of astropy

171b263

Fix float64 implementation with multi-chunk files

d525792

More cleanup of tests

5221336

Added changelog entry

1e1a1e7

Cleanup

ebb5ebc

astrofrog added the Ready for final review label Feb 19, 2020

keflavich commented Feb 19, 2020

View reviewed changes

keflavich merged commit 8bcd395 into radio-astro-tools:master Feb 19, 2020

e-koch mentioned this pull request Feb 20, 2020

Stokes I empty mask #608

Closed

astrofrog mentioned this pull request Feb 23, 2020

Added Python and Numpy implementation of getdminfo #609

Merged

3 tasks

keflavich deleted the casa_numpy_reader branch December 13, 2021 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an experimental "homebrew" CASA .image reader #607

Add an experimental "homebrew" CASA .image reader #607

keflavich commented Feb 9, 2020

keflavich commented Feb 9, 2020

coveralls commented Feb 9, 2020 •

edited

keflavich commented Feb 9, 2020

astrofrog left a comment •

edited

astrofrog Feb 10, 2020

astrofrog Feb 10, 2020

keflavich Feb 10, 2020

astrofrog Feb 10, 2020

keflavich Feb 10, 2020

astrofrog Feb 10, 2020

keflavich Feb 10, 2020

astrofrog commented Feb 14, 2020 •

edited

astrofrog commented Feb 19, 2020 •

edited

astrofrog commented Feb 19, 2020

keflavich Feb 19, 2020

astrofrog Feb 20, 2020


		cube = SpectralCube.read(filename)

		make_casa_testimage_of_shape(shape, tmp_path / 'casa.image')

Add an experimental "homebrew" CASA .image reader #607

Add an experimental "homebrew" CASA .image reader #607

Conversation

keflavich commented Feb 9, 2020

keflavich commented Feb 9, 2020

coveralls commented Feb 9, 2020 • edited

keflavich commented Feb 9, 2020

astrofrog left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrofrog commented Feb 14, 2020 • edited

astrofrog commented Feb 19, 2020 • edited

astrofrog commented Feb 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Feb 9, 2020 •

edited

astrofrog left a comment •

edited

astrofrog commented Feb 14, 2020 •

edited

astrofrog commented Feb 19, 2020 •

edited