Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's the best way to apply a channel slice? #109

Open
o-smirnov opened this issue Apr 24, 2020 · 7 comments
Open

what's the best way to apply a channel slice? #109

o-smirnov opened this issue Apr 24, 2020 · 7 comments
Assignees
Labels

Comments

@o-smirnov
Copy link
Contributor

Been asking elsewhere, but it occurs to me it's a general enough question to be usefully asked here....

So, what's the best way to read a subset of channels? I can apply a slice to the array objects of course, but is there a way to do this up front? Or a better way?

@o-smirnov o-smirnov assigned o-smirnov and sjperkins and unassigned o-smirnov Apr 24, 2020
@sjperkins
Copy link
Member

To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions

# Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
                          for nc in nchan]

# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]

# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks)

# This should slice the channel selection optimally without dask block overlap
datasets[0].DATA[:, chanslice, :]

I typed this out without running it, but it should illustrate the idea.

The above is clunky, I'm thinking about general improvements for the process in #86

@Mulan-94
Copy link
Contributor

@sjperkins , I'm unable to follow properly. I usually just do dataset[0].sel(chan=slice(start, end)) . And the chunking for the channel seems to be reconstructed automatically by dask.
Is this the wrong way :( ?

@sjperkins
Copy link
Member

@sjperkins , I'm unable to follow properly. I usually just do dataset[0].sel(chan=slice(start, end)) . And the chunking for the channel seems to be reconstructed automatically by dask.
Is this the wrong way :( ?

It's not wrong, but that way will end up reading more data from the MS than necessary. What the example is trying to demonstrate is that to prevent this behaviour the dask chunking must be correctly set up in xds_from_ms(..., chunks={...}).

I agree that it's not necessarily easy to follow, I simply haven't found a reasonable way of making this easy yet. #86 should probably form the nucleus for dealing with this.

@sjperkins
Copy link
Member

Lets look at this with a concrete example

nchan = 64
chanslice = slice(37, 56)
chan_chunks = (37, 19, 8) = (37 - 0, 56 - 37, 64 - 56)

datasets = xds_from_ms(..., chunks={..., 'chan': chan_chunks})
ds.DATA.data[:, chanslice, :]

that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using getcolslice.

If by contrast, we don't ask for channel chunking:

datasets = xds_from_ms(...)
ds.DATA.data[:, chanslice, :]

there'll be a single channel chunk for the dask DATA arrays containing the full 64 channels. So there'll be a getcol operation behind that reads all the channel data followed by a slice that returns the channel subset.

@sjperkins
Copy link
Member

I also want to say its good that people are pointing this out as a conceptually difficult thing, because I do want to make things easier and reduce the cognitive overhead.

@o-smirnov
Copy link
Contributor Author

As an aging professor, I can only applaud the sentiment. My cognitive is all overhead!

@Mulan-94
Copy link
Contributor

that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using getcolslice.

This makes it clearer, thanks :) !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants