Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channel selection does not lead to anticipated speedup for meerKAT MSs #31

Open
o-smirnov opened this issue Apr 19, 2020 · 8 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@o-smirnov
Copy link
Collaborator

I've added a channel subset selector, which works in the usual slicing manner::

shadems ms-4k-cal.ms --col CORRECTED_DATA -x FREQ -y amp --corr XX,YY --color-by phase --cmin -2 --cmax 2 -z 10000 --chan 1000:1100
shadems ms-4k-cal.ms --col CORRECTED_DATA -x FREQ -y amp --corr XX,YY --color-by phase --cmin -2 --cmax 2 -z 10000 

The first case pots 100 channels, the second case plots all 4096. However the speedup in the first case is only x2. This suggests that not much I/O has been saved.

I apply the channel selection by slicing the DataArray in the group object, using array[dict=(chan=chanslice)], which I thought was the prescribed manner to do this (@sjperkins please confirm).

Perhap the problem is with the DataManager in the MS. Looking at it:

>>> tab.getdminfo('CORRECTED_DATA')
{'TYPE': 'TiledShapeStMan', 'NAME': 'TiledCorrected', 'SEQNR': 24, 'SPEC': {'MaxCacheSize': 0, 'DEFAULTTILESHAPE': array([  4,  60, 546], dtype=int32), 'MAXIMUMCACHESIZE': 0, 'HYPERCUBES': {'*1': {'CubeShape': array([      4,    4096, 1234296], dtype=int32), 'TileShape': array([   4, 4096,    8], dtype=int32), 'CellShape': array([   4, 4096], dtype=int32), 'BucketSize': 1048576, 'ID': {}}}, 'SEQNR': 24, 'IndexSize': 1}}

...it doesn't tile the channel dimension. Which means that the underlying table system is still reading entire rows, which is very inefficient if one only wants a small subset of the channels. In which case this is a katdal issue to be fixed.

@sjperkins please also confirm -- if I'm slicing the array as above, it will eventually result in a getcolslice[np]() call to read the data and not getcol(), correct?

@sjperkins
Copy link
Member

I apply the channel selection by slicing the DataArray in the group object, using array[dict=(chan=chanslice)], which I thought was the prescribed manner to do this (@sjperkins please confirm).

The following should do it: xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice})

@sjperkins please also confirm -- if I'm slicing the array as above, it will eventually result in a getcolslicenp call to read the data and not getcol(), correct?

yes, but it's set up through xds_from_ms.

re: katdal, I think channel tiling depends on the application (spectral, continuum etc.) but lets chat in person

@o-smirnov
Copy link
Collaborator Author

BTW, @sjperkins, @JSKenyon, is the recent activity I saw on dask-ms related to getcolslicenp relevant here perhaps?

@o-smirnov
Copy link
Collaborator Author

The following should do it: xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice})

@sjperkins are you sure? Surely that's the chunkinfg specification, not the channel subset specification. When I do it that way, it crashes with:

   File "/scratch/oms/projects/dask-ms/daskms/columns.py", line 219, in column_metadata
    dc = da.core.normalize_chunks(dc, shape=(s,))
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/dask/array/core.py", line 2440, in normalize_chunks
    and len(chunks) > 1
TypeError: object of type 'slice' has no len()

Can't find it in the dask-ms docs...

@sjperkins
Copy link
Member

The following should do it: xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice})

@sjperkins are you sure? Surely that's the chunkinfg specification, not the channel subset specification. When I do it that way, it crashes with:

   File "/scratch/oms/projects/dask-ms/daskms/columns.py", line 219, in column_metadata
    dc = da.core.normalize_chunks(dc, shape=(s,))
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/dask/array/core.py", line 2440, in normalize_chunks
    and len(chunks) > 1
TypeError: object of type 'slice' has no len()

Can't find it in the dask-ms docs...

Yes should be an integer or a tuple specifying the individual channels chunks (which should add up to the full channel range).

So for 60 channels, {'chan': 16} or {'chan': (16, 16, 16, 12)} will achieve the same thing.

@sjperkins
Copy link
Member

xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice.end - chanslice.start})

@o-smirnov
Copy link
Collaborator Author

But where is the actual channel subset specified, if I want one? I've been applying it as a slice to the DataArrays, is that the best way to do it?

@sjperkins
Copy link
Member

sjperkins commented Apr 24, 2020

But where is the actual channel subset specified, if I want one? I've been applying it as a slice to the DataArrays, is that the best way to do it?

Apologies I'm losing the plot in this thread. You can slice the DataArrays with the caveat that it'll read the entire channel range and then slice the relevant chunk out.

To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions

# Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
                          for nc in nchan]

# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]

# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks)

I typed this out without running it, but it should illustrate the idea.

The above is clunky, I'm thinking about general improvements for the process in ratt-ru/dask-ms#86

@o-smirnov o-smirnov self-assigned this May 25, 2020
@o-smirnov o-smirnov added the enhancement New feature or request label May 25, 2020
@o-smirnov
Copy link
Collaborator Author

Bottom line is, I need to adjust the chunking as per above for maximum efficiency.

@ratt-ru ratt-ru deleted a comment from codeFairOfficial Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants