New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-standard MS columns have an auto-generated schema which is not chunked according to logic for standard dimensions #241
Comments
Thanks for reporting. Can you run the command again within pdb as follows: $ python -m pdb $(which dask-ms) convert ms1_primary.ms -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -o ms1_primary.zarr --chunks="{row:25000,chan:128}" --format zarr --force and report on the dimensions of
|
This probably because it get created upfront. I'll bet that its filled with zeros. What does |
Is it possible that it is actually a subtable that is causing the problem? I don't recall how those are chunked (or, indeed, if they are left unchunked). |
The chunks are as expected and DATA seems populated
I also suspect it may be one of the subtables because it happens right at the end of a run. I am still waiting for it to fall over again so I can report the information you asked for @sjperkins (unfortunately oates is acting up again and things are taking forever) |
That'd be a really big subtable if a column has ~2GiB of data. Also just spitballing some figures (complex64 == 8 bytes) 98332 x 4096 x 4 x 8 ~= 24GiB |
Hmmmm that is interesting... |
They are left unchunked. One way of finding out if there are large subtables would be to do something like a: $ du -hs ms1_primary.ms/ |
It does not seem that way
|
Actually this isn't quite true. A default chunking of 10,000 rows is applied.
OK must be one of the DATA columns then. If you have your initial attempt at writing the zarr dataset out still lying around, can you do a: from pprint import pprint
datasets = xds_from_ms(...)
pprint(list(dict(ds.chunks) for ds in datasets)) |
Ah, I have some non-standard columns in there
Looks like the RESIDUAL column is written with frequency chunks of 4096. This works at the outset because it can be compressed? |
Ah, could this be a schema thing? A column not in the default schema will not know about the Edit: Posted before I saw the above. I think this is def the root cause. |
Ah yes |
@JSKenyon and I just discussed this in a meets. The problem here is that there are non-standard columns in the MS. As it stands
|
My 2 cents:
A possible alternative (albeit not a very clean one) would be to check if unknown dimensions match existing dimensions in the MS and then chunk them the same (eg. in the above case RESIDUAL-1 matches 'chan' along axis 1 and could be chunked the same). I suspect this will work 99% of the time. Maybe print a warning if this is the case, throw an error if any non-standard columns don't match any existing dimensions and resort to 1). |
This may be the best option as it can be detected quickly. That said, this could get very unwieldy if an MS has lots of non-standard columns.
It is true that this wouldn't fix the problem if the column was written by software not using dask-ms. However, it might by a decent 90% solution with the remaining 10% solved by option 1.
This is possible but it can be slightly brittle. For the MAIN table, this is plausible as we could do as you suggest with a dim priority in the event that there are dims of the same size. My suggested priority would be Unfortunately, none of the above succeed in completely hiding this from the user, although option 2 will come close for our software e.g. QuartiCal and pfb-clean. I think that adding non-standard columns is relatively unusual in the legacy stack (outside of CubiCal etc). Finally, all of the above is only true for the main table. Subtables are probably even tougher to deal with as they each have different dims. On top of that we need to remember that |
How about applying chunking heuristics to DATA-like columns only? Can you think of any other column "schemas" that Quartical/pfb-clean/Cubical use? |
That will suffice for my purposes.
Only the WEIGHT column but I believe that will be deprecated eventually and I am strongly opposed to using it anyway |
@Athanaseus just got hit by this again. Converting an MS with non-standard columns leaves the resulting dataset in a state that is hard to deal with since the will not have the expected chan and corr dimensions for non-standard columns |
This MS also has a BITFLAG column |
I should block off some time to look at this tomorrow. One possible workaround is to use the |
Thanks @sjperkins. This is the column he wants to image. Actually he used the |
Are these columns shaped like DATA/FLAG? |
Description
I am trying to convert an MS to zarr chunked by row and channel and its falling over with
ValueError: Codec does not support buffers of > 2147483647 bytes
despite the chunks only containing 25000 rows and 128 channels (around 25 MB by my count). Somewhat weirdly I think the error is mostly harmless because it does produce a dataset that I can subsequently read. I have not checked if all the subtables are what they should be though.What I Did
Here is the full output from convert
But I can still read the main table
so I am not sure what is happening here.
The text was updated successfully, but these errors were encountered: