Support optional CASA columns #270

sjperkins · 2022-11-09T16:58:42Z

Closes MS descriptors don't support optional columns #271
Tests added / passed
```
$ py.test -v -s daskms/tests
```
If the pep8 tests fail, the quickest way to correct
this is to run autopep8 and then flake8 and
pycodestyle to fix the remaining issues.
```
$ pip install -U autopep8 flake8 pycodestyle
$ autopep8 -r -i daskms
$ flake8 daskms
$ pycodestyle daskms
```
Fully documented, including HISTORY.rst for all changes
and one of the docs/*-api.rst files for new API

To build the docs locally:
```
pip install -r requirements.readthedocs.txt
cd docs
READTHEDOCS=True make html
```

sjperkins · 2022-11-09T17:34:11Z

@miguelcarcamov Thanks for the Measurement Set provided in #271 (comment). It's a good example of the use of optional CASA columns and subtables within a Measurement Set and really exercises the conversion functionality.

Initially the conversion process was falling over on the POINTING subtable due to an attempt to add the optional OVER_THE_TOP column. This PR has modifed dask-ms to support these optional columns when creating new Measurement Sets i.e. during dask-ms convert.

However there are other, arguably non-standard sub-tables associated with this Measurement Set that are challenging for dask-ms to represent. For example, the ASDM_SOURCE sub-table has a frequency column that has variably-shaped, or non-existent values. See below for example:

In [4]: A = pt.table("/home/simon/data/HLTau_B6cont.calavg.tav300s::ASDM_SOURCE")
Successful readonly open of default-locked table /home/simon/data/HLTau_B6cont.calavg.tav300s::ASDM_SOURCE: 32 columns, 126 rows

In [5]: A.getcol("frequency")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [5], line 1
----> 1 A.getcol("frequency")

File ~/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.8/lib/python3.8/site-packages/casacore/tables/table.py:1032, in table.getcol(self, columnname, startrow, nrow, rowincr)
   1012 """Get the contents of a column or part of it.
   1013
   1014 It is returned as a numpy array.
   (...)
   1021
   1022 """
   1023 #        try:     # trial code to read using a vector of rownrs
   1024 #            nr = len(startrow)
   1025 #            if nrow < 0:
   (...)
   1030 #                i = inx*
   1031 #        except:
-> 1032 return self._getcol(columnname, startrow, nrow, rowincr)

RuntimeError: Table DataManager error: Internal error: StManIndArray::get/put shapes not conforming

In [6]: A.getcell("frequency", 64)
Out[6]:
array([1.49896229e+09, 4.99654097e+09, 8.10249886e+09, 1.49896229e+10,
       4.28274940e+10, 9.77000000e+10, 1.09800000e+11, 4.85000000e+09,
       9.14600000e+10, 1.03490000e+11])

In [7]: A.getcell("frequency", 0)
Out[7]: array([2.24e+11])

In [8]: A.getcell("frequency", 125)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [8], line 1
----> 1 A.getcell("frequency", 125)

File ~/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.8/lib/python3.8/site-packages/casacore/tables/table.py:959, in table.getcell(self, columnname, rownr)
    952 def getcell(self, columnname, rownr):
    953     """Get data from a column cell.
    954
    955     Get the contents of a cell which can be returned as a scalar value,
    956     a numpy array, or a dict depending on the contents of the cell.
    957
    958     """
--> 959     return self._getcell(columnname, rownr)

RuntimeError: Table DataManager error: Invalid operation: SSMIndColumn::getShape: no array in row 125 in column frequency of table /home/simon/data/HLTau_B6cont.calavg.tav300s/ASDM_SOURCE

This sort of variably-shaped data is difficult to represent with dask arrays and it would be a lot of extra effort to support this type of data in dask-ms for (in my mind) small reward.

Do you stricitly need the data in ASDM_SOURCE?

I'm inclined to add the ability to exclude sub-tables during the conversion process (we already ignore the SOURCE and SORTED_TABLE columns (for similar reasons).

What are your thoughts on this?

miguelcarcamov · 2022-11-09T18:00:32Z

What are your thoughts on this?

Hi @sjperkins thanks to you for putting effort and time on this! Tbh, I'm not so sure where these tables come from. From what I can see with casabrowser it seems to me that they come from the ALMA calibration process (that dataset is from ALMA). Having said this - At least for self-cal and imaging the data in ASDM_SOURCE table is not important, so excluding them from the conversion process sounds good to me atm. I could also ask to the alma helpdesk where this tables come from and where are used if you want me to :).

sjperkins · 2022-11-10T09:20:52Z

What are your thoughts on this?

Hi @sjperkins thanks to you for putting effort and time on this! Tbh, I'm not so sure where these tables come from. From what I can see with casabrowser it seems to me that they come from the ALMA calibration process (that dataset is from ALMA). Having said this - At least for self-cal and imaging the data in ASDM_SOURCE table is not important, so excluding them from the conversion process sounds good to me atm. I could also ask to the alma helpdesk where this tables come from and where are used if you want me to :).

It looks like they're related to the ALMA raw data format (ADSM). Quoting from CASA fundamentals:

The ALMA and VLA raw data format, however, is not the MS but the so-called Astronomy Science Data Model (ASDM), also referred to as the ALMA Science Data Model for ALMA, and the Science Data Model (SDM) for the VLA. The ALMA and VLA archives hence do not store data in MS format but in ASDM format, and when a CASA user starts to work with this data, the first step has to be the import of the ASDM into the CASA MS format.

I would speculate that the ADSM_* subtables are vestiges of the conversion from ASDM into the CTDS MS. I doubt that any software besides the ALMA pipeline would understand them, although please correct me I'm wrong here.

The purpose of dask-ms convert is to convert MS-like data between the CASA Table Data System (CTDS) format and newer cloud-native format's like Zarr and Apache Arrow. MS-to-MS conversion is possible, but ultimately we think that moving to cloud-native formats is the way to go for the quantities of data that modern interferometers produce. In that sense we don't want to spend time supporting custom sub-tables and CTDS formats.

I'll add a subtable exclusion flag to this PR.

miguelcarcamov · 2022-11-10T11:37:28Z

Hi @sjperkins

I would speculate that the ADSM_* subtables are vestiges of the conversion from ASDM into the CTDS MS. I doubt that any software besides the ALMA pipeline would understand them, although please correct me I'm wrong here.

I think you are correct on this.

The purpose of dask-ms convert is to convert MS-like data between the CASA Table Data System (CTDS) format and newer cloud-native format's like Zarr and Apache Arrow. MS-to-MS conversion is possible, but ultimately we think that moving to cloud-native formats is the way to go for the quantities of data that modern interferometers produce. In that sense we don't want to spend time supporting custom sub-tables and CTDS formats.

That makes totally sense to me. I think that's the right direction for dask-ms convert. It makes totally sense to not spend time on supporting custom sub-tables or CTDS formats. In fact we got to this because I wanted to re-order the measurement set to do self-cal and imaging. I glad we got to this point because we realized that ALMA datasets have ASDM subtables that make difficult the conversion or re-ordering. I guess that if I should have tried to do the same conversion reordering the measurement but converting it to Zarr I would have encountered the same error right?

sjperkins · 2022-11-10T12:16:21Z

The following succeeds for me on this branch:

dask-ms convert ~/data/HLTau_B6cont.calavg.tav300s -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -i "ANTENNA1,ANTENNA2,TIME,FEED1,FEED2" -o /tmp/output.ms --format ms --force -x "ASDM_ANTENNA::*,ASDM_CALATMOSPHERE::*,ASDM_CALWVR::*,ASDM_RECEIVER::*,ASDM_SOURCE::*,ASDM_STATION::*"

sjperkins · 2022-11-10T12:17:15Z

I guess that if I should have tried to do the same conversion reordering the measurement but converting it to Zarr I would have encountered the same error right?

Yes.

sjperkins · 2022-11-10T14:55:36Z

@miguelcarcamov Is HLTau_B6cont.calavg.tav300s freely available somewhere (on the NRAO website for example)? I'm considering using it as a basis for a testsuite for dask-ms convert and storing it on a public s3 bucket, but I'd like to be mindful of data ownership concerns.

miguelcarcamov · 2022-11-10T14:59:53Z

@miguelcarcamov Is HLTau_B6cont.calavg.tav300s freely available somewhere (on the NRAO website for example)? I'm considering using it as a basis for a testsuite for dask-ms convert and storing it on a public s3 bucket, but I'd like to be mindful of data ownership concerns.

@sjperkins Yes, that dataset and other ones from ALMA are freely available here. You can find it as HLTau Band 6. Although it's not strictly the same because I have applied the self-calibration script and then a time average to it, the dataset is freely available and there should not be problems for you when using it :)

miguelcarcamov · 2022-11-10T20:33:46Z

I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :(

sjperkins · 2022-11-11T08:06:56Z

I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :(

Thanks for confirming. Unfortunately, there's not much that can be done about the processing time -- due to the indexing, CASA is reading non-contiguous data on disk. To get better performance, I think a proper distributed shuffle would be required

https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/

JSKenyon · 2022-11-11T08:13:45Z

Out of interest, how large was the dataset in question @miguelcarcamov?

sjperkins · 2022-11-11T08:19:13Z

Out of interest, how large was the dataset in question @miguelcarcamov?

$ du -hs HLTau_B6cont.calavg.tav300s/
161M    HLTau_B6cont.calavg.tav300s/

JSKenyon · 2022-11-11T08:31:53Z

$ du -hs HLTau_B6cont.calavg.tav300s/
161M    HLTau_B6cont.calavg.tav300s/

Hmmm, interesting. I wonder if there would be anything to be gained from doing this on a per-spw, per-field, per-scan basis i.e. a bit like tricolour. So rather than reading random rows, read a contiguous chunk a write it out to a per-baseline format. It is a bit special casey though. I agree with you that there isn't a great way around the problem (without a special case implementation).

sjperkins · 2022-11-11T08:43:36Z

$ du -hs HLTau_B6cont.calavg.tav300s/
161M    HLTau_B6cont.calavg.tav300s/
Hmmm, interesting. I wonder if there would be anything to be gained from doing this on a per-spw, per-field, per-scan basis i.e. a bit like tricolour. So rather than reading random rows, read a contiguous chunk a write it out to a per-baseline format. It is a bit special casey though. I agree with you that there isn't a great way around the problem (without a special case implementation).

That entails assumptions about the size of the grouping. IIRC ~100GB for a 32K MeerKAT scan which is doable on an HPC node.

miguelcarcamov · 2022-11-12T13:53:20Z

I was thinking on two things after realizing how much time does it take to dask-ms convert to re-order a 161 MB dataset.

Given the time it takes the re-ordering from "TIME, ANTENNA1, ANTENNA2" to "ANTENNA1, ANTENNA2, TIME" to do self-cal would take forever on each self-cal iteration. Maybe it's better to use only the first ordering for both. But that takes me to that question of what if there is a really big dataset. It would take forever to re-order and then do the self-cal.
Would it be faster if instead of converting to MS it's converted to zarr? (I don't think so, it should take the same)

sjperkins · 2022-11-14T08:19:58Z

I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :(

Interesting, I get the following on an SSD:

time dask-ms convert ~/data/HLTau_B6cont.calavg.tav300s -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -i "ANTENNA1,ANTENNA2,TIME,FEED1,FEED2" -o /tmp/output.ms --format ms --force -x "ASDM_ANTENNA::*,ASDM_CALATMOSPHERE::*,ASDM_CALWVR::*,ASDM_RECEIVER::*,ASDM_SOURCE::*,ASDM_STATION::*"

real    3m33.486s
user    1m37.530s
sys     0m23.161s

sjperkins · 2022-11-14T08:33:50Z

I was thinking on two things after realizing how much time does it take to dask-ms convert to re-order a 161 MB dataset.

Given the time it takes the re-ordering from "TIME, ANTENNA1, ANTENNA2" to "ANTENNA1, ANTENNA2, TIME" to do self-cal would take forever on each self-cal iteration. Maybe it's better to use only the first ordering for both. But that takes me to that question of what if there is a really big dataset. It would take forever to re-order and then do the self-cal.

Would it be faster if instead of converting to MS it's converted to zarr? (I don't think so, it should take the same)

There currently isn't any SQL support for the ZARR (or Arrow) backend in dask-ms, and therefore non-contiguous disk access patterns haven't been revealed yet. Having said that, there are many SQL engines available for Apache Arrow. See for e.g.

I've alluded to this previously: #159 (comment), but broadly speaking there are three options available:

Rely on the SQL engine to return ordered data, hopefully accessing non-contiguous data in as optimal a pattern as possible.
Implement data transposes (in-memory or on-disk) at fairly coarse granularity (i.e. FIELD_ID,DATA_DESC_ID, SCAN_NUMBER). We achieve this tricolour with some dask, for instance.
Rely on the execution engine to perform a distributed shuffle:

So there's a tension between relying on the underlying storage engine and relying on the execution engine. (2) is a compromise between them, for example.

sjperkins added 2 commits November 9, 2022 18:58

Support optional CASA columns

6ec9b8e

[skip ci] Update HISTORY.rst

904af9a

Neaten up dminfo check

fc0c070

Support excluding sub-tables

8566127

sjperkins added 9 commits November 11, 2022 13:48

Application testing

1c4347d

typo

5c45c11

Correct hash

aaedf62

Correct indent

c4fa9db

Import pytest

14133dd

More precise test step name

5f711dd

Clean up mark handling

eb0bac1

further fixes to optional marks

e9d1fc6

conversion test case

5548238

sjperkins merged commit 482873a into master Nov 11, 2022

sjperkins deleted the support-optional-casa-columns branch November 11, 2022 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support optional CASA columns #270

Support optional CASA columns #270

sjperkins commented Nov 9, 2022 •

edited

Loading

sjperkins commented Nov 9, 2022 •

edited

Loading

miguelcarcamov commented Nov 9, 2022

sjperkins commented Nov 10, 2022

miguelcarcamov commented Nov 10, 2022

sjperkins commented Nov 10, 2022

sjperkins commented Nov 10, 2022

sjperkins commented Nov 10, 2022

miguelcarcamov commented Nov 10, 2022

miguelcarcamov commented Nov 10, 2022

sjperkins commented Nov 11, 2022

JSKenyon commented Nov 11, 2022

sjperkins commented Nov 11, 2022

JSKenyon commented Nov 11, 2022

sjperkins commented Nov 11, 2022

miguelcarcamov commented Nov 12, 2022

sjperkins commented Nov 14, 2022

sjperkins commented Nov 14, 2022 •

edited

Loading

Support optional CASA columns #270

Support optional CASA columns #270

Conversation

sjperkins commented Nov 9, 2022 • edited Loading

sjperkins commented Nov 9, 2022 • edited Loading

miguelcarcamov commented Nov 9, 2022

sjperkins commented Nov 10, 2022

miguelcarcamov commented Nov 10, 2022

sjperkins commented Nov 10, 2022

sjperkins commented Nov 10, 2022

sjperkins commented Nov 10, 2022

miguelcarcamov commented Nov 10, 2022

miguelcarcamov commented Nov 10, 2022

sjperkins commented Nov 11, 2022

JSKenyon commented Nov 11, 2022

sjperkins commented Nov 11, 2022

JSKenyon commented Nov 11, 2022

sjperkins commented Nov 11, 2022

miguelcarcamov commented Nov 12, 2022

sjperkins commented Nov 14, 2022

sjperkins commented Nov 14, 2022 • edited Loading

sjperkins commented Nov 9, 2022 •

edited

Loading

sjperkins commented Nov 9, 2022 •

edited

Loading

sjperkins commented Nov 14, 2022 •

edited

Loading