Experimental clean-up release #285

sjperkins · 2023-08-29T10:56:10Z

Description

We've been experimenting with a number of features in dask-ms for a while now. Now that we better understand how these features operate in a distributed environment, its worth consolidating these into a list for a release. This may involve breaking some experimental functionality and hopefully less of the existing non-experimental functionality.

TODO

Remove the vendored xarray dataset and make xarray a first class dependency.
- The original reason for this was that xarray didn't handle nan chunks, which we used in averaging code because it was thought that it wasn't possible to determine the number of output rows (especially in the case of BDA data).
- This is doable (use the Quartical approach of keeping the index in memory) by applying the averaging mapping functions.
- This will effect https://github.com/ratt-ru/xova and possibly pfb-clean
Use xarray's open_zarr and to_zarr methods
- We wrote our own zarr accessors due to root task overproduction in the original xarray zarr implementation creating suboptimal graphs
- See root task overproduction. https://www.coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept
- Inspection of dask and xarray code show that these root nodes have been removed.
- Also might be mitigated by https://www.coiled.io/blog/reducing-dask-memory-usage
Consolidate metadata into a single xarray __daskms_metadata__. Metadata such as the following should be placed within it:
- Partitioning schema
- Casa Descriptors
- Original MS Opening Parameters
Consolidate MS support into xds_{from,to}_ms #273
- Originally, xds_to_table handled writes to any CASA table, while xds_from_ms existed to interpret a CASA table (and subtables) as a MS-specific, adding appropriate coordinates to the MS columns.
- It's probably more ergonomic to consolidate this into xds_{from/to}_ms accessors.
- Also review what's strictly necessary in these interfaces.
Swap out python-casacore for https://github.com/ratt-ru/arcae for CASA table access:
- At minimum this should give us thread-safe distributed reads
- It's not clear if we can get distributed writes through this approach and whether this is worthwhile.
  - Determine whether this is worth the effort
Incorporate fragments Add experimental code for data fragments. #282

The text was updated successfully, but these errors were encountered:

sjperkins · 2023-08-29T11:01:01Z

/cc @JSKenyon @landmanbester @o-smirnov

JSKenyon · 2023-08-30T08:46:21Z

It all looks good to me! Will gladly help out where necessary.

sjperkins · 2023-08-30T13:47:11Z

I've been thinking a bit about xarray as a first class dependency and the fact that xarray doesn't support nan chunks. Here's the xarray issue:

NaN-sized chunks pydata/xarray#2801

In the case of standard averaging this will require reification of TIME, INTERVAL, ANTENNA1 and ANTENNA2 columns (and probably CHAN_FREQ and CHAN_WIDTH too).

def row_mapper(time, interval, antenna1, antenna2,
               flag_row=None, time_bin_secs=1):

In the more complex BDA case this requires reification of TIME, INTERVAL, ANTENNA1, ANTENNA2, UVW, CHAN_WIDTH and CHAN_FREQ.

def bda_mapper(time, interval, ant1, ant2, uvw,  chan_width, chan_freq,
                           max_uvw_dist, flag_row=None, max_fov=3.0,
                           decorrelation=0.98,
                           time_bin_secs=None,
                           min_nchan=1):

I can broadly see two approaches

Place reified coordinates on Datasets

The obvious way of handlng this (which I've avoided up until now due to memory concerns) would be to make the following fully reified coordinates on the resultant xarray datasets:

MSv2.0 indexing columns
SPECTRAL_WINDOW CHAN_FREQ and CHAN_WIDTH .
POLARIZATION CORR_PRODUCT
Perhaps some relevant FEED columns.

Pros:

We can reason about the grid in-memory, and produce accurately chunked dask arrays
Following on from the point above, it would also give us the ability to check for regularity in the grid

Cons:

The memory overhead of fully reified columns in the driver program, and this will increase as dump rates and antennas do too.
For the BDA case, we'd need to reify UVW coordinates further compounding driver program memory costs. These are 3xTIME, but it would still be preferable not to.
- Maybe UVW can be made an optional coordinate
- Or use TIME_CENTROID and ANTENNA::POSITION to generate UVW on the fly.

Compute chunk sizes in the indexing columns in as a pre-compute step

We tend to do this in our applications in any case (@JSKenyon in Quartical, myself in xova, @landmanbester ?).

Pros:

Not memory-intensive

Cons:

Read the indexing columns twice just to get properly chunked dask arrays.
This may become more problematic as data sizes increase.

landmanbester · 2023-09-02T12:15:44Z

We tend to do this in our applications in any case (@JSKenyon in Quartical, myself in xova, @landmanbester ?).

Yup, I read the indexing columns to set things up as well. It would be nice to have them reified. Maybe don't go as far as reifying uvw's but having access to time and frequency without calling compute is very handy

JSKenyon · 2023-09-05T07:43:11Z

I've been thinking a bit about xarray as a first class dependency and the fact that xarray doesn't support nan chunks.

Wholeheartedly agree with making it a first class dependency. The nan chunk problem is irritating but as you mentioned, there are ways around it.

In the case of standard averaging this will require reification of TIME, INTERVAL, ANTENNA1 and ANTENNA2 columns (and probably CHAN_FREQ and CHAN_WIDTH too).

This is certainly possible. As you say, it will make things more memory intensive in the driver, but also gives us access to a a fair amount of powerful xarray functionality.

In the more complex BDA case this requires reification of TIME, INTERVAL, ANTENNA1, ANTENNA2, UVW, CHAN_WIDTH and CHAN_FREQ.

I think if we add this functionality, we do it in such a way that a user may optionally specify which elements are coordinates and which ones should be read/exposed. That way you don't get reified UVW unless you need it.

* [MSv2.0 indexing columns](https://casa.nrao.edu/Memos/229.html#SECTION00061000000000000000)

* `SPECTRAL_WINDOW` `CHAN_FREQ` and `CHAN_WIDTH` .

* `POLARIZATION` `CORR_PRODUCT`

* Perhaps some relevant `FEED` columns.

I would say that I want to seriously discourage bolting every indexing column into the coordinates - I think that we need some sort of mechanism for getting what we need in an application specific way. That said, even with all of the above, it is still probably manageable for current problem sizes (although I agree that it could become a problem). Some of the above could be compressed substantially in memory if there was RLE or something of the like in the xarray coordinates (not sure if this exists/could exist).

Pros:

* We can reason about the grid in-memory, and produce accurately chunked dask arrays

* Following on from the point above, it would also give us the ability to check for regularity in the grid

Just reiterating that this gives us a great deal more power in xarray land.

Cons:

* The memory overhead of fully reified columns in the driver program, and this will increase as dump rates and antennas do too.
* For the BDA case, we'd need to reify `UVW` coordinates further compounding driver program memory costs. These are 3xTIME, but it would still be preferable not to.

There will still be a disconnect in the chunking problem i.e. users would still need to use the time and channel information to decide on appropriate chunks and then do another xds_from_* call to produce the relevant dask arrays.

Compute chunk sizes in the indexing columns in as a pre-compute step

We tend to do this in our applications in any case (@JSKenyon in Quartical, myself in xova, @landmanbester ?).

Pros:
* Not memory-intensive

Cons:

* Read the indexing columns twice just to get properly chunked dask arrays.

* This may become more problematic as data sizes increase.

I don't really see the indexing columns ever becoming a problem relative to the actual data. I guess one can argue that the driver may end up with a fair amount of data in memory when doing graph construction.

Apologies, I think that this may be a difficult reply to read due to all the quoting. I do think that there are two separate problems to consider here.

The first is that we currently limit xarray functionality due to the absence of true coordinate values. This could be solved by allowing users to optionally request certain data_vars (columns) as coordinates during the read. The tricky part is how we attach those coordinates to the other data_vars e.g. if we require both CHAN_FREQ and CHAN_WIDTH, which one is the proper coordinate? We could also opt to establish a CHAN_ID which would be a global integer index over all channels in a spectral window (or optionally over all spectral windows?). Then each CHAN_ID could be associated with a CHAN_FREQ and CHAN_WIDTH. This would pave the way for more easily splitting and combining datasets in frequency.

The second problem is the chunking. The reason that many of our applications have this two-step process is that dask-ms is not currently aware of the way in which we typically chunk our data i.e. in blocks of complete time and channel. If dask-ms had a more elaborate chunking mechanism, it may be possible to skip this step in user code e.g. in xds_from_* support passing a function and column names to the chunk argument such that the desired chunking can be ensured on the first read.

JSKenyon · 2023-09-26T07:19:22Z

Another point in favour of reified coordinates/extending the ROWID behaviour to other common axes is that we could support partial updates (see #93). This would make writing data after selection much easier (at present it requires a fair amount of wrangling downstream).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental clean-up release #285

Experimental clean-up release #285

sjperkins commented Aug 29, 2023 •

edited

sjperkins commented Aug 29, 2023 •

edited

JSKenyon commented Aug 30, 2023

sjperkins commented Aug 30, 2023 •

edited

landmanbester commented Sep 2, 2023

JSKenyon commented Sep 5, 2023

Compute chunk sizes in the indexing columns in as a pre-compute step

JSKenyon commented Sep 26, 2023

Experimental clean-up release #285

Experimental clean-up release #285

Comments

sjperkins commented Aug 29, 2023 • edited

Description

TODO

sjperkins commented Aug 29, 2023 • edited

JSKenyon commented Aug 30, 2023

sjperkins commented Aug 30, 2023 • edited

Place reified coordinates on Datasets

Compute chunk sizes in the indexing columns in as a pre-compute step

landmanbester commented Sep 2, 2023

JSKenyon commented Sep 5, 2023

Compute chunk sizes in the indexing columns in as a pre-compute step

JSKenyon commented Sep 26, 2023

sjperkins commented Aug 29, 2023 •

edited

sjperkins commented Aug 29, 2023 •

edited

sjperkins commented Aug 30, 2023 •

edited