-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support optional CASA columns #270
Conversation
@miguelcarcamov Thanks for the Measurement Set provided in #271 (comment). It's a good example of the use of optional CASA columns and subtables within a Measurement Set and really exercises the conversion functionality. Initially the conversion process was falling over on the POINTING subtable due to an attempt to add the optional OVER_THE_TOP column. This PR has modifed dask-ms to support these optional columns when creating new Measurement Sets i.e. during However there are other, arguably non-standard sub-tables associated with this Measurement Set that are challenging for dask-ms to represent. For example, the In [4]: A = pt.table("/home/simon/data/HLTau_B6cont.calavg.tav300s::ASDM_SOURCE")
Successful readonly open of default-locked table /home/simon/data/HLTau_B6cont.calavg.tav300s::ASDM_SOURCE: 32 columns, 126 rows
In [5]: A.getcol("frequency")
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In [5], line 1
----> 1 A.getcol("frequency")
File ~/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.8/lib/python3.8/site-packages/casacore/tables/table.py:1032, in table.getcol(self, columnname, startrow, nrow, rowincr)
1012 """Get the contents of a column or part of it.
1013
1014 It is returned as a numpy array.
(...)
1021
1022 """
1023 # try: # trial code to read using a vector of rownrs
1024 # nr = len(startrow)
1025 # if nrow < 0:
(...)
1030 # i = inx*
1031 # except:
-> 1032 return self._getcol(columnname, startrow, nrow, rowincr)
RuntimeError: Table DataManager error: Internal error: StManIndArray::get/put shapes not conforming
In [6]: A.getcell("frequency", 64)
Out[6]:
array([1.49896229e+09, 4.99654097e+09, 8.10249886e+09, 1.49896229e+10,
4.28274940e+10, 9.77000000e+10, 1.09800000e+11, 4.85000000e+09,
9.14600000e+10, 1.03490000e+11])
In [7]: A.getcell("frequency", 0)
Out[7]: array([2.24e+11])
In [8]: A.getcell("frequency", 125)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 A.getcell("frequency", 125)
File ~/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.8/lib/python3.8/site-packages/casacore/tables/table.py:959, in table.getcell(self, columnname, rownr)
952 def getcell(self, columnname, rownr):
953 """Get data from a column cell.
954
955 Get the contents of a cell which can be returned as a scalar value,
956 a numpy array, or a dict depending on the contents of the cell.
957
958 """
--> 959 return self._getcell(columnname, rownr)
RuntimeError: Table DataManager error: Invalid operation: SSMIndColumn::getShape: no array in row 125 in column frequency of table /home/simon/data/HLTau_B6cont.calavg.tav300s/ASDM_SOURCE This sort of variably-shaped data is difficult to represent with dask arrays and it would be a lot of extra effort to support this type of data in Do you stricitly need the data in I'm inclined to add the ability to exclude sub-tables during the conversion process (we already ignore the What are your thoughts on this? |
Hi @sjperkins thanks to you for putting effort and time on this! Tbh, I'm not so sure where these tables come from. From what I can see with casabrowser it seems to me that they come from the ALMA calibration process (that dataset is from ALMA). Having said this - At least for self-cal and imaging the data in |
It looks like they're related to the ALMA raw data format (ADSM). Quoting from CASA fundamentals:
I would speculate that the The purpose of I'll add a subtable exclusion flag to this PR. |
Hi @sjperkins
I think you are correct on this.
That makes totally sense to me. I think that's the right direction for dask-ms convert. It makes totally sense to not spend time on supporting custom sub-tables or CTDS formats. In fact we got to this because I wanted to re-order the measurement set to do self-cal and imaging. I glad we got to this point because we realized that ALMA datasets have ASDM subtables that make difficult the conversion or re-ordering. I guess that if I should have tried to do the same conversion reordering the measurement but converting it to Zarr I would have encountered the same error right? |
The following succeeds for me on this branch: dask-ms convert ~/data/HLTau_B6cont.calavg.tav300s -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -i "ANTENNA1,ANTENNA2,TIME,FEED1,FEED2" -o /tmp/output.ms --format ms --force -x "ASDM_ANTENNA::*,ASDM_CALATMOSPHERE::*,ASDM_CALWVR::*,ASDM_RECEIVER::*,ASDM_SOURCE::*,ASDM_STATION::*" |
Yes. |
@miguelcarcamov Is |
@sjperkins Yes, that dataset and other ones from ALMA are freely available here. You can find it as HLTau Band 6. Although it's not strictly the same because I have applied the self-calibration script and then a time average to it, the dataset is freely available and there should not be problems for you when using it :) |
I have tried the command with this branch and I confirm that it works. However, I'm a little bit concerned of the processing time which for me was 21m 43s :( |
Thanks for confirming. Unfortunately, there's not much that can be done about the processing time -- due to the indexing, CASA is reading non-contiguous data on disk. To get better performance, I think a proper distributed shuffle would be required |
Out of interest, how large was the dataset in question @miguelcarcamov? |
$ du -hs HLTau_B6cont.calavg.tav300s/
161M HLTau_B6cont.calavg.tav300s/ |
Hmmm, interesting. I wonder if there would be anything to be gained from doing this on a per-spw, per-field, per-scan basis i.e. a bit like tricolour. So rather than reading random rows, read a contiguous chunk a write it out to a per-baseline format. It is a bit special casey though. I agree with you that there isn't a great way around the problem (without a special case implementation). |
That entails assumptions about the size of the grouping. IIRC ~100GB for a 32K MeerKAT scan which is doable on an HPC node. |
I was thinking on two things after realizing how much time does it take to dask-ms convert to re-order a 161 MB dataset.
|
Interesting, I get the following on an SSD: time dask-ms convert ~/data/HLTau_B6cont.calavg.tav300s -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -i "ANTENNA1,ANTENNA2,TIME,FEED1,FEED2" -o /tmp/output.ms --format ms --force -x "ASDM_ANTENNA::*,ASDM_CALATMOSPHERE::*,ASDM_CALWVR::*,ASDM_RECEIVER::*,ASDM_SOURCE::*,ASDM_STATION::*"
real 3m33.486s
user 1m37.530s
sys 0m23.161s |
There currently isn't any SQL support for the ZARR (or Arrow) backend in dask-ms, and therefore non-contiguous disk access patterns haven't been revealed yet. Having said that, there are many SQL engines available for Apache Arrow. See for e.g. I've alluded to this previously: #159 (comment), but broadly speaking there are three options available:
So there's a tension between relying on the underlying storage engine and relying on the execution engine. (2) is a compromise between them, for example. |
Closes MS descriptors don't support optional columns #271
Tests added / passed
If the pep8 tests fail, the quickest way to correct
this is to run
autopep8
and thenflake8
andpycodestyle
to fix the remaining issues.Fully documented, including
HISTORY.rst
for all changesand one of the
docs/*-api.rst
files for new APITo build the docs locally: