channel selection does not lead to anticipated speedup for meerKAT MSs #31

o-smirnov · 2020-04-19T17:24:56Z

I've added a channel subset selector, which works in the usual slicing manner::

shadems ms-4k-cal.ms --col CORRECTED_DATA -x FREQ -y amp --corr XX,YY --color-by phase --cmin -2 --cmax 2 -z 10000 --chan 1000:1100
shadems ms-4k-cal.ms --col CORRECTED_DATA -x FREQ -y amp --corr XX,YY --color-by phase --cmin -2 --cmax 2 -z 10000

The first case pots 100 channels, the second case plots all 4096. However the speedup in the first case is only x2. This suggests that not much I/O has been saved.

I apply the channel selection by slicing the DataArray in the group object, using array[dict=(chan=chanslice)], which I thought was the prescribed manner to do this (@sjperkins please confirm).

Perhap the problem is with the DataManager in the MS. Looking at it:

>>> tab.getdminfo('CORRECTED_DATA')
{'TYPE': 'TiledShapeStMan', 'NAME': 'TiledCorrected', 'SEQNR': 24, 'SPEC': {'MaxCacheSize': 0, 'DEFAULTTILESHAPE': array([  4,  60, 546], dtype=int32), 'MAXIMUMCACHESIZE': 0, 'HYPERCUBES': {'*1': {'CubeShape': array([      4,    4096, 1234296], dtype=int32), 'TileShape': array([   4, 4096,    8], dtype=int32), 'CellShape': array([   4, 4096], dtype=int32), 'BucketSize': 1048576, 'ID': {}}}, 'SEQNR': 24, 'IndexSize': 1}}

...it doesn't tile the channel dimension. Which means that the underlying table system is still reading entire rows, which is very inefficient if one only wants a small subset of the channels. In which case this is a katdal issue to be fixed.

@sjperkins please also confirm -- if I'm slicing the array as above, it will eventually result in a getcolslice[np]() call to read the data and not getcol(), correct?

The text was updated successfully, but these errors were encountered:

sjperkins · 2020-04-19T19:16:34Z

I apply the channel selection by slicing the DataArray in the group object, using array[dict=(chan=chanslice)], which I thought was the prescribed manner to do this (@sjperkins please confirm).

The following should do it: xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice})

@sjperkins please also confirm -- if I'm slicing the array as above, it will eventually result in a getcolslicenp call to read the data and not getcol(), correct?

yes, but it's set up through xds_from_ms.

re: katdal, I think channel tiling depends on the application (spectral, continuum etc.) but lets chat in person

o-smirnov · 2020-04-24T12:34:24Z

BTW, @sjperkins, @JSKenyon, is the recent activity I saw on dask-ms related to getcolslicenp relevant here perhaps?

o-smirnov · 2020-04-24T12:38:39Z

The following should do it: xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice})

@sjperkins are you sure? Surely that's the chunkinfg specification, not the channel subset specification. When I do it that way, it crashes with:

   File "/scratch/oms/projects/dask-ms/daskms/columns.py", line 219, in column_metadata
    dc = da.core.normalize_chunks(dc, shape=(s,))
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/dask/array/core.py", line 2440, in normalize_chunks
    and len(chunks) > 1
TypeError: object of type 'slice' has no len()

Can't find it in the dask-ms docs...

sjperkins · 2020-04-24T13:12:02Z

The following should do it: xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice})

@sjperkins are you sure? Surely that's the chunkinfg specification, not the channel subset specification. When I do it that way, it crashes with:
   File "/scratch/oms/projects/dask-ms/daskms/columns.py", line 219, in column_metadata
    dc = da.core.normalize_chunks(dc, shape=(s,))
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/dask/array/core.py", line 2440, in normalize_chunks
    and len(chunks) > 1
TypeError: object of type 'slice' has no len()
Can't find it in the dask-ms docs...

Yes should be an integer or a tuple specifying the individual channels chunks (which should add up to the full channel range).

So for 60 channels, {'chan': 16} or {'chan': (16, 16, 16, 12)} will achieve the same thing.

sjperkins · 2020-04-24T13:21:23Z

xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice.end - chanslice.start})

o-smirnov · 2020-04-24T13:28:55Z

But where is the actual channel subset specified, if I want one? I've been applying it as a slice to the DataArrays, is that the best way to do it?

sjperkins · 2020-04-24T13:52:33Z

But where is the actual channel subset specified, if I want one? I've been applying it as a slice to the DataArrays, is that the best way to do it?

Apologies I'm losing the plot in this thread. You can slice the DataArrays with the caveat that it'll read the entire channel range and then slice the relevant chunk out.

To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions

# Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
                          for nc in nchan]

# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]

# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks)

I typed this out without running it, but it should illustrate the idea.

The above is clunky, I'm thinking about general improvements for the process in ratt-ru/dask-ms#86

o-smirnov · 2020-05-25T13:57:05Z

Bottom line is, I need to adjust the chunking as per above for maximum efficiency.

ulricharmel mentioned this issue Apr 21, 2020

Invalid index error related to scan numbers #33

Closed

o-smirnov mentioned this issue Apr 24, 2020

what's the best way to apply a channel slice? ratt-ru/dask-ms#109

Open

o-smirnov self-assigned this May 25, 2020

o-smirnov added the enhancement New feature or request label May 25, 2020

ratt-ru deleted a comment from codeFairOfficial Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

channel selection does not lead to anticipated speedup for meerKAT MSs #31

channel selection does not lead to anticipated speedup for meerKAT MSs #31

o-smirnov commented Apr 19, 2020

sjperkins commented Apr 19, 2020

o-smirnov commented Apr 24, 2020

o-smirnov commented Apr 24, 2020

sjperkins commented Apr 24, 2020

sjperkins commented Apr 24, 2020

o-smirnov commented Apr 24, 2020

sjperkins commented Apr 24, 2020 •

edited

Loading

o-smirnov commented May 25, 2020

channel selection does not lead to anticipated speedup for meerKAT MSs #31

channel selection does not lead to anticipated speedup for meerKAT MSs #31

Comments

o-smirnov commented Apr 19, 2020

sjperkins commented Apr 19, 2020

o-smirnov commented Apr 24, 2020

o-smirnov commented Apr 24, 2020

sjperkins commented Apr 24, 2020

sjperkins commented Apr 24, 2020

o-smirnov commented Apr 24, 2020

sjperkins commented Apr 24, 2020 • edited Loading

o-smirnov commented May 25, 2020

sjperkins commented Apr 24, 2020 •

edited

Loading