Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optional CASA columns #270

Merged
merged 13 commits into from
Nov 11, 2022
Merged

Conversation

sjperkins
Copy link
Member

@sjperkins sjperkins commented Nov 9, 2022

  • Closes MS descriptors don't support optional columns #271

  • Tests added / passed

    $ py.test -v -s daskms/tests

    If the pep8 tests fail, the quickest way to correct
    this is to run autopep8 and then flake8 and
    pycodestyle to fix the remaining issues.

    $ pip install -U autopep8 flake8 pycodestyle
    $ autopep8 -r -i daskms
    $ flake8 daskms
    $ pycodestyle daskms
    
  • Fully documented, including HISTORY.rst for all changes
    and one of the docs/*-api.rst files for new API

    To build the docs locally:

    pip install -r requirements.readthedocs.txt
    cd docs
    READTHEDOCS=True make html
    

@sjperkins
Copy link
Member Author

sjperkins commented Nov 9, 2022

@miguelcarcamov Thanks for the Measurement Set provided in #271 (comment). It's a good example of the use of optional CASA columns and subtables within a Measurement Set and really exercises the conversion functionality.

Initially the conversion process was falling over on the POINTING subtable due to an attempt to add the optional OVER_THE_TOP column. This PR has modifed dask-ms to support these optional columns when creating new Measurement Sets i.e. during dask-ms convert.

However there are other, arguably non-standard sub-tables associated with this Measurement Set that are challenging for dask-ms to represent. For example, the ASDM_SOURCE sub-table has a frequency column that has variably-shaped, or non-existent values. See below for example:

In [4]: A = pt.table("/home/simon/data/HLTau_B6cont.calavg.tav300s::ASDM_SOURCE")
Successful readonly open of default-locked table /home/simon/data/HLTau_B6cont.calavg.tav300s::ASDM_SOURCE: 32 columns, 126 rows

In [5]: A.getcol("frequency")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [5], line 1
----> 1 A.getcol("frequency")

File ~/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.8/lib/python3.8/site-packages/casacore/tables/table.py:1032, in table.getcol(self, columnname, startrow, nrow, rowincr)
   1012 """Get the contents of a column or part of it.
   1013
   1014 It is returned as a numpy array.
   (...)
   1021
   1022 """
   1023 #        try:     # trial code to read using a vector of rownrs
   1024 #            nr = len(startrow)
   1025 #            if nrow < 0:
   (...)
   1030 #                i = inx*
   1031 #        except:
-> 1032 return self._getcol(columnname, startrow, nrow, rowincr)

RuntimeError: Table DataManager error: Internal error: StManIndArray::get/put shapes not conforming

In [6]: A.getcell("frequency", 64)
Out[6]:
array([1.49896229e+09, 4.99654097e+09, 8.10249886e+09, 1.49896229e+10,
       4.28274940e+10, 9.77000000e+10, 1.09800000e+11, 4.85000000e+09,
       9.14600000e+10, 1.03490000e+11])

In [7]: A.getcell("frequency", 0)
Out[7]: array([2.24e+11])

In [8]: A.getcell("frequency", 125)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [8], line 1
----> 1 A.getcell("frequency", 125)

File ~/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.8/lib/python3.8/site-packages/casacore/tables/table.py:959, in table.getcell(self, columnname, rownr)
    952 def getcell(self, columnname, rownr):
    953     """Get data from a column cell.
    954
    955     Get the contents of a cell which can be returned as a scalar value,
    956     a numpy array, or a dict depending on the contents of the cell.
    957
    958     """
--> 959     return self._getcell(columnname, rownr)

RuntimeError: Table DataManager error: Invalid operation: SSMIndColumn::getShape: no array in row 125 in column frequency of table /home/simon/data/HLTau_B6cont.calavg.tav300s/ASDM_SOURCE

This sort of variably-shaped data is difficult to represent with dask arrays and it would be a lot of extra effort to support this type of data in dask-ms for (in my mind) small reward.

Do you stricitly need the data in ASDM_SOURCE?

I'm inclined to add the ability to exclude sub-tables during the conversion process (we already ignore the SOURCE and SORTED_TABLE columns (for similar reasons).

What are your thoughts on this?

@miguelcarcamov
Copy link

What are your thoughts on this?

Hi @sjperkins thanks to you for putting effort and time on this! Tbh, I'm not so sure where these tables come from. From what I can see with casabrowser it seems to me that they come from the ALMA calibration process (that dataset is from ALMA). Having said this - At least for self-cal and imaging the data in ASDM_SOURCE table is not important, so excluding them from the conversion process sounds good to me atm. I could also ask to the alma helpdesk where this tables come from and where are used if you want me to :).

@sjperkins
Copy link
Member Author

What are your thoughts on this?

Hi @sjperkins thanks to you for putting effort and time on this! Tbh, I'm not so sure where these tables come from. From what I can see with casabrowser it seems to me that they come from the ALMA calibration process (that dataset is from ALMA). Having said this - At least for self-cal and imaging the data in ASDM_SOURCE table is not important, so excluding them from the conversion process sounds good to me atm. I could also ask to the alma helpdesk where this tables come from and where are used if you want me to :).

It looks like they're related to the ALMA raw data format (ADSM). Quoting from CASA fundamentals:

The ALMA and VLA raw data format, however, is not the MS but the so-called Astronomy Science Data Model (ASDM), also referred to as the ALMA Science Data Model for ALMA, and the Science Data Model (SDM) for the VLA. The ALMA and VLA archives hence do not store data in MS format but in ASDM format, and when a CASA user starts to work with this data, the first step has to be the import of the ASDM into the CASA MS format.

I would speculate that the ADSM_* subtables are vestiges of the conversion from ASDM into the CTDS MS. I doubt that any software besides the ALMA pipeline would understand them, although please correct me I'm wrong here.

The purpose of dask-ms convert is to convert MS-like data between the CASA Table Data System (CTDS) format and newer cloud-native format's like Zarr and Apache Arrow. MS-to-MS conversion is possible, but ultimately we think that moving to cloud-native formats is the way to go for the quantities of data that modern interferometers produce. In that sense we don't want to spend time supporting custom sub-tables and CTDS formats.

I'll add a subtable exclusion flag to this PR.

@miguelcarcamov
Copy link

Hi @sjperkins

I would speculate that the ADSM_* subtables are vestiges of the conversion from ASDM into the CTDS MS. I doubt that any software besides the ALMA pipeline would understand them, although please correct me I'm wrong here.

I think you are correct on this.

The purpose of dask-ms convert is to convert MS-like data between the CASA Table Data System (CTDS) format and newer cloud-native format's like Zarr and Apache Arrow. MS-to-MS conversion is possible, but ultimately we think that moving to cloud-native formats is the way to go for the quantities of data that modern interferometers produce. In that sense we don't want to spend time supporting custom sub-tables and CTDS formats.

That makes totally sense to me. I think that's the right direction for dask-ms convert. It makes totally sense to not spend time on supporting custom sub-tables or CTDS formats. In fact we got to this because I wanted to re-order the measurement set to do self-cal and imaging. I glad we got to this point because we realized that ALMA datasets have ASDM subtables that make difficult the conversion or re-ordering. I guess that if I should have tried to do the same conversion reordering the measurement but converting it to Zarr I would have encountered the same error right?

@sjperkins
Copy link
Member Author

The following succeeds for me on this branch:

dask-ms convert ~/data/HLTau_B6cont.calavg.tav300s -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -i "ANTENNA1,ANTENNA2,TIME,FEED1,FEED2" -o /tmp/output.ms --format ms --force -x "ASDM_ANTENNA::*,ASDM_CALATMOSPHERE::*,ASDM_CALWVR::*,ASDM_RECEIVER::*,ASDM_SOURCE::*,ASDM_STATION::*"

@sjperkins
Copy link
Member Author

I guess that if I should have tried to do the same conversion reordering the measurement but converting it to Zarr I would have encountered the same error right?

Yes.

@sjperkins
Copy link
Member Author

@miguelcarcamov Is HLTau_B6cont.calavg.tav300s freely available somewhere (on the NRAO website for example)? I'm considering using it as a basis for a testsuite for dask-ms convert and storing it on a public s3 bucket, but I'd like to be mindful of data ownership concerns.

@miguelcarcamov
Copy link

@miguelcarcamov Is HLTau_B6cont.calavg.tav300s freely available somewhere (on the NRAO website for example)? I'm considering using it as a basis for a testsuite for dask-ms convert and storing it on a public s3 bucket, but I'd like to be mindful of data ownership concerns.

@sjperkins Yes, that dataset and other ones from ALMA are freely available here. You can find it as HLTau Band 6. Although it's not strictly the same because I have applied the self-calibration script and then a time average to it, the dataset is freely available and there should not be problems for you when using it :)

@miguelcarcamov
Copy link

I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :(

@sjperkins
Copy link
Member Author

I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :(

Thanks for confirming. Unfortunately, there's not much that can be done about the processing time -- due to the indexing, CASA is reading non-contiguous data on disk. To get better performance, I think a proper distributed shuffle would be required

@JSKenyon
Copy link
Collaborator

Out of interest, how large was the dataset in question @miguelcarcamov?

@sjperkins
Copy link
Member Author

Out of interest, how large was the dataset in question @miguelcarcamov?

$ du -hs HLTau_B6cont.calavg.tav300s/
161M    HLTau_B6cont.calavg.tav300s/

@JSKenyon
Copy link
Collaborator

$ du -hs HLTau_B6cont.calavg.tav300s/
161M    HLTau_B6cont.calavg.tav300s/

Hmmm, interesting. I wonder if there would be anything to be gained from doing this on a per-spw, per-field, per-scan basis i.e. a bit like tricolour. So rather than reading random rows, read a contiguous chunk a write it out to a per-baseline format. It is a bit special casey though. I agree with you that there isn't a great way around the problem (without a special case implementation).

@sjperkins
Copy link
Member Author

$ du -hs HLTau_B6cont.calavg.tav300s/
161M    HLTau_B6cont.calavg.tav300s/

Hmmm, interesting. I wonder if there would be anything to be gained from doing this on a per-spw, per-field, per-scan basis i.e. a bit like tricolour. So rather than reading random rows, read a contiguous chunk a write it out to a per-baseline format. It is a bit special casey though. I agree with you that there isn't a great way around the problem (without a special case implementation).

That entails assumptions about the size of the grouping. IIRC ~100GB for a 32K MeerKAT scan which is doable on an HPC node.

@sjperkins sjperkins merged commit 482873a into master Nov 11, 2022
@sjperkins sjperkins deleted the support-optional-casa-columns branch November 11, 2022 17:32
@miguelcarcamov
Copy link

I was thinking on two things after realizing how much time does it take to dask-ms convert to re-order a 161 MB dataset.

  1. Given the time it takes the re-ordering from "TIME, ANTENNA1, ANTENNA2" to "ANTENNA1, ANTENNA2, TIME" to do self-cal would take forever on each self-cal iteration. Maybe it's better to use only the first ordering for both. But that takes me to that question of what if there is a really big dataset. It would take forever to re-order and then do the self-cal.

  2. Would it be faster if instead of converting to MS it's converted to zarr? (I don't think so, it should take the same)

@sjperkins
Copy link
Member Author

I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :(

Interesting, I get the following on an SSD:

time dask-ms convert ~/data/HLTau_B6cont.calavg.tav300s -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -i "ANTENNA1,ANTENNA2,TIME,FEED1,FEED2" -o /tmp/output.ms --format ms --force -x "ASDM_ANTENNA::*,ASDM_CALATMOSPHERE::*,ASDM_CALWVR::*,ASDM_RECEIVER::*,ASDM_SOURCE::*,ASDM_STATION::*"

real    3m33.486s
user    1m37.530s
sys     0m23.161s

@sjperkins
Copy link
Member Author

sjperkins commented Nov 14, 2022

I was thinking on two things after realizing how much time does it take to dask-ms convert to re-order a 161 MB dataset.

  1. Given the time it takes the re-ordering from "TIME, ANTENNA1, ANTENNA2" to "ANTENNA1, ANTENNA2, TIME" to do self-cal would take forever on each self-cal iteration. Maybe it's better to use only the first ordering for both. But that takes me to that question of what if there is a really big dataset. It would take forever to re-order and then do the self-cal.
  1. Would it be faster if instead of converting to MS it's converted to zarr? (I don't think so, it should take the same)

There currently isn't any SQL support for the ZARR (or Arrow) backend in dask-ms, and therefore non-contiguous disk access patterns haven't been revealed yet. Having said that, there are many SQL engines available for Apache Arrow. See for e.g.

I've alluded to this previously: #159 (comment), but broadly speaking there are three options available:

  1. Rely on the SQL engine to return ordered data, hopefully accessing non-contiguous data in as optimal a pattern as possible.
  2. Implement data transposes (in-memory or on-disk) at fairly coarse granularity (i.e. FIELD_ID,DATA_DESC_ID, SCAN_NUMBER). We achieve this tricolour with some dask, for instance.
  3. Rely on the execution engine to perform a distributed shuffle:

So there's a tension between relying on the underlying storage engine and relying on the execution engine. (2) is a compromise between them, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MS descriptors don't support optional columns
3 participants